P-value for Intersection of Multiple Circle Venn Diagram

I was making a lot of venn diagrams in the past week and encountered to this question: How likely can the number of the intersection (middle part) of venn diagram occur by chance? Let’s say you have 4 DNA microarray samples (A, B, C and CT). In this dataset, you have 7000 genes  detected in all samples, and 200, 250, and 300 genes were up-regulated in A, B and C sample, respectively. If 3 genes are in common for all three samples, what is the likelihood to have this situation by chance? In other words, what is the chance to have at least 3 genes in common in this dataset.

Detected Genes: 7000
Up-regulated in A: 200
Up-regulated in B: 250
Up-regulated in C: 300
Common in A, B & C: 3

If venn diagram is drawn for this, it look like this:

venn

To answer this question we can
1) Derive statistical formula to calculate likelihood
2) Perform a simulation for a number of times to derive p-value

For me, 2) seems much easier to implement than 1), so I am going to describe how I did to simulate the situation

Here is the R code

## calculate p-value for three circle venn diagram 
## intersection
drawtest3<-function(N,m,a,b,c){
#N=7000 # total number of genes for selection
#m=3 # number of overlaps in all three sets
#a=200 # number of up-regulated genes in sample A
#b=250 # number of up-regulated genes in sample B
#c=300 # number of up-regualted genes in sample C

#create a dataframe to store frequency table
d<-data.frame(0:min(a,b,c),rep(0,min(a,b,c)+1))

for (i in 1:10000){
A=sample(1:N,size=a,replace=FALSE)
B=sample(1:N,size=b,replace=FALSE)
C=sample(1:N,size=c,replace=FALSE)
e<-table((C %in% A)&(C %in% B))["TRUE"]
# if there is no intersection, put 0 instead of NA
if(!complete.cases(e)){
e<-0
}
# add one to the counter for corresponding occurence
d[e+1,2]<-d[e+1,2]+1
}
colnames(d)<-c("Intersect","p-value")
# convert counts to ratios
d[,2]<-d[,2]/10000
# calculate p-value
p<-sum(d[(m+1):(min(a,b,c)+1),2])
#print(d)
return(p)
}

There are 5 arguments in this function to call correctly. In the example above,
>drawtest3(7000,3,200,250,300)
[1] 0.0027

If you want to see actual distribution of p-values, remove the # at the bottom of the code.

Intersect p-value
1 0 0.7255
2 1 0.2374
3 2 0.0344
4 3 0.0027
5 4 0.0000
6 5 0.0000
7 6 0.0000
8 7 0.0000
9 8 0.0000
.....
.....

As you can see, most of the cases you get no intersection (0.7255).  Having 3 in common for all three samples are very unlikely event (0.0027) by chance.  More than 3 genes in common actually never occurred in 10000 trials in this case. To make sure this code is correct, let’t try again in a small scale.

There are 5 balls numbered from 1 to 5 in a box. Three people A, B and C will draw 4 balls from the box. What is the chance that everybody gets the same ball. In this case, the arguments for the function will be

N=5
m=4
A=4
B=4
C=4

To calculate this probability, it is easier to think which ball A didn’t pick. Since A will pick 4 out of 5 balls, the possibilities that A will not pick one ball are 5.

A's pick     Remainder
{1,2,3,4}       5
{1,2,3,5}       4
{1,2,4,5}       3
{1,3,4,5}       2
{2,3,4,5}       1

Now B and C have to do the same as A. To not to pick 5 for both B and C is 1/5×1/5=1/25
So the probability that all of them have the same will be
1/5 x 1/5 x 1/5 x 5= 1/25 or 0.04

>drawtest3(5,4,4,4,4)
 Intersect p-value
1 0 0.0000
2 1 0.0000
3 2 0.4884
4 3 0.4703
5 4 0.0413
[1] 0.0413

My result was a little off from the theoretical 0.04. You may get a closer value if you draw more, but it may take longer to compute. The result also says you always have at least 2 balls in common. To get reproducible results, please set a seed before you run this code.

To this end, when you have multiple comparisons, having common genes across many samples are a very unlikely event by chance. Therefore, you may change the condition by relaxing to “present at least in 2 samples” instead of present in all samples. In any case, it is worthwhile to run this kind of simulation before you decide what condition should be applied for gene selection.

About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: