Archive | March 2014

P-value for Intersection of Multiple Circle Venn Diagram

I was making a lot of venn diagrams in the past week and encountered to this question: How likely can the number of the intersection (middle part) of venn diagram occur by chance? Let’s say you have 4 DNA microarray samples (A, B, C and CT). In this dataset, you have 7000 genes  detected in all samples, and 200, 250, and 300 genes were up-regulated in A, B and C sample, respectively. If 3 genes are in common for all three samples, what is the likelihood to have this situation by chance? In other words, what is the chance to have at least 3 genes in common in this dataset.

Detected Genes: 7000
Up-regulated in A: 200
Up-regulated in B: 250
Up-regulated in C: 300
Common in A, B & C: 3

If venn diagram is drawn for this, it look like this:

venn

To answer this question we can
1) Derive statistical formula to calculate likelihood
2) Perform a simulation for a number of times to derive p-value

For me, 2) seems much easier to implement than 1), so I am going to describe how I did to simulate the situation

Here is the R code

## calculate p-value for three circle venn diagram 
## intersection
drawtest3<-function(N,m,a,b,c){
#N=7000 # total number of genes for selection
#m=3 # number of overlaps in all three sets
#a=200 # number of up-regulated genes in sample A
#b=250 # number of up-regulated genes in sample B
#c=300 # number of up-regualted genes in sample C

#create a dataframe to store frequency table
d<-data.frame(0:min(a,b,c),rep(0,min(a,b,c)+1))

for (i in 1:10000){
A=sample(1:N,size=a,replace=FALSE)
B=sample(1:N,size=b,replace=FALSE)
C=sample(1:N,size=c,replace=FALSE)
e<-table((C %in% A)&(C %in% B))["TRUE"]
# if there is no intersection, put 0 instead of NA
if(!complete.cases(e)){
e<-0
}
# add one to the counter for corresponding occurence
d[e+1,2]<-d[e+1,2]+1
}
colnames(d)<-c("Intersect","p-value")
# convert counts to ratios
d[,2]<-d[,2]/10000
# calculate p-value
p<-sum(d[(m+1):(min(a,b,c)+1),2])
#print(d)
return(p)
}

There are 5 arguments in this function to call correctly. In the example above,
>drawtest3(7000,3,200,250,300)
[1] 0.0027

If you want to see actual distribution of p-values, remove the # at the bottom of the code.

Intersect p-value
1 0 0.7255
2 1 0.2374
3 2 0.0344
4 3 0.0027
5 4 0.0000
6 5 0.0000
7 6 0.0000
8 7 0.0000
9 8 0.0000
.....
.....

As you can see, most of the cases you get no intersection (0.7255).  Having 3 in common for all three samples are very unlikely event (0.0027) by chance.  More than 3 genes in common actually never occurred in 10000 trials in this case. To make sure this code is correct, let’t try again in a small scale.

There are 5 balls numbered from 1 to 5 in a box. Three people A, B and C will draw 4 balls from the box. What is the chance that everybody gets the same ball. In this case, the arguments for the function will be

N=5
m=4
A=4
B=4
C=4

To calculate this probability, it is easier to think which ball A didn’t pick. Since A will pick 4 out of 5 balls, the possibilities that A will not pick one ball are 5.

A's pick     Remainder
{1,2,3,4}       5
{1,2,3,5}       4
{1,2,4,5}       3
{1,3,4,5}       2
{2,3,4,5}       1

Now B and C have to do the same as A. To not to pick 5 for both B and C is 1/5×1/5=1/25
So the probability that all of them have the same will be
1/5 x 1/5 x 1/5 x 5= 1/25 or 0.04

>drawtest3(5,4,4,4,4)
 Intersect p-value
1 0 0.0000
2 1 0.0000
3 2 0.4884
4 3 0.4703
5 4 0.0413
[1] 0.0413

My result was a little off from the theoretical 0.04. You may get a closer value if you draw more, but it may take longer to compute. The result also says you always have at least 2 balls in common. To get reproducible results, please set a seed before you run this code.

To this end, when you have multiple comparisons, having common genes across many samples are a very unlikely event by chance. Therefore, you may change the condition by relaxing to “present at least in 2 samples” instead of present in all samples. In any case, it is worthwhile to run this kind of simulation before you decide what condition should be applied for gene selection.

%d bloggers like this: