Calculation of the representation factor and the associated probability
Two groups of genes are compared and found to have x genes in common. A representation factor and the probability of of finding an overlap of x genes are calculated.Representation factor
The representation factor is the number of overlaping genes divided by the expected number of overlaping genes drawn from two independent groups.A representation factor > 1 indicates more overlap than expected of two independent groups, a representation factor < 1 indicates less overlap than expected, and a representation factor of 1 indicates that the two groups by the number of genes expected for independent groups of genes.
x = # of genes in common between two groups.
n = # of genes in group 1.
D = # of genes in group 2.
N = total genes, in this case the 17611 genes with good spots on the Kim lab full genome chips.
C(a,b) is the number of combinations of a things taken b at a time.
The representation factor = x / expected # of genes.
Expected # of genes = (n * D) / N
Probability
Exact hypergeometric probability
The probability of finding x overlapping genes can be calculated using the hypergeometric probability formula:C(D, x) * C(N-D, n-x) / C(N,n)
If x is less than the expected number of overlapping genes, the probability of finding x or fewer genes is:
Prob = sum (i=0 to i=x) [ C(D, i) * C(N-D, n-i) / C(N,n) ]
If x is greater than the expected number of overlapping genes, the probability of finding x or more genes is:
Prob = 1- sum (i=0 to i=(x-1)) [ C(D, i) * C(N-D, n-i) / C(N,n) ]
C(a,b) is calculated using:
C(a,b) = a! / ((a - b)! * b!)
C(a,b) = exp(gammln(a + 1)-gammln(1+(a - b))-gammln(b + 1))
gamma(a + 1) = a!.
and gammln(a + 1) is an approximation of a!
calculated using code from Chapter 6.1 of
Numerical Recipes in C: The Art of Scientific Computing (ISBN 0-521-43108-5)
Normal approximation
The exact hypergeometric probability is difficult to calculate, so a normal approximation is used when:p +/- 2*sqrt(p*q/n) is > 0 and n * 10 < N.
where p = D / N and q = 1 - p.
The normal approximation is:
Z = abs (((x-.5) - n * p) / sqrt(n * p * q)).
Prob = P { Z } where Z is a standard normal variate from N(0,1).
P { Z } = 1 - ((erff(Z / sqrt(2)) + 1) / 2
where erff is the error function calcuated using code from Chapter 6.2 of
Numerical Recipes in C: The Art of Scientific Computing (ISBN 0-521-43108-5)
Return to the home page
Please send comments or questions regarding this web page to Jim Lund (jlund256 at gmail dot com)