# P-value for Intersection of Multiple Circle Venn Diagram

I was making a lot of venn diagrams in the past week and encountered to this question: How likely can the number of the intersection (middle part) of venn diagram occur by chance? Let’s say you have 4 DNA microarray samples (A, B, C and CT). In this dataset, you have 7000 genes detected in all samples, and 200, 250, and 300 genes were up-regulated in A, B and C sample, respectively. If 3 genes are in common for all three samples, what is the likelihood to have this situation by chance? In other words, what is the chance to have at least 3 genes in common in this dataset.

Detected Genes: 7000

Up-regulated in A: 200

Up-regulated in B: 250

Up-regulated in C: 300

Common in A, B & C: 3

If venn diagram is drawn for this, it look like this:

To answer this question we can

1) Derive statistical formula to calculate likelihood

2) Perform a simulation for a number of times to derive p-value

For me, 2) seems much easier to implement than 1), so I am going to describe how I did to simulate the situation

Here is the R code

## calculate p-value for three circle venn diagram ## intersection drawtest3<-function(N,m,a,b,c){ #N=7000 # total number of genes for selection #m=3 # number of overlaps in all three sets #a=200 # number of up-regulated genes in sample A #b=250 # number of up-regulated genes in sample B #c=300 # number of up-regualted genes in sample C #create a dataframe to store frequency table d<-data.frame(0:min(a,b,c),rep(0,min(a,b,c)+1)) for (i in 1:10000){ A=sample(1:N,size=a,replace=FALSE) B=sample(1:N,size=b,replace=FALSE) C=sample(1:N,size=c,replace=FALSE) e<-table((C %in% A)&(C %in% B))["TRUE"] # if there is no intersection, put 0 instead of NA if(!complete.cases(e)){ e<-0 } # add one to the counter for corresponding occurence d[e+1,2]<-d[e+1,2]+1 } colnames(d)<-c("Intersect","p-value") # convert counts to ratios d[,2]<-d[,2]/10000 # calculate p-value p<-sum(d[(m+1):(min(a,b,c)+1),2]) #print(d) return(p) }

There are 5 arguments in this function to call correctly. In the example above,

>drawtest3(7000,3,200,250,300)

[1] 0.0027

If you want to see actual distribution of p-values, remove the # at the bottom of the code.

```
Intersect p-value
1 0 0.7255
2 1 0.2374
3 2 0.0344
4 3 0.0027
5 4 0.0000
6 5 0.0000
7 6 0.0000
8 7 0.0000
9 8 0.0000
.....
.....
```

As you can see, most of the cases you get no intersection (0.7255). Having 3 in common for all three samples are very unlikely event (0.0027) by chance. More than 3 genes in common actually never occurred in 10000 trials in this case. To make sure this code is correct, let’t try again in a small scale.

There are 5 balls numbered from 1 to 5 in a box. Three people A, B and C will draw 4 balls from the box. What is the chance that everybody gets the same ball. In this case, the arguments for the function will be

N=5

m=4

A=4

B=4

C=4

To calculate this probability, it is easier to think which ball A didn’t pick. Since A will pick 4 out of 5 balls, the possibilities that A will not pick one ball are 5.

A's pick Remainder {1,2,3,4} 5 {1,2,3,5} 4 {1,2,4,5} 3 {1,3,4,5} 2 {2,3,4,5} 1

Now B and C have to do the same as A. To not to pick 5 for both B and C is 1/5×1/5=1/25

So the probability that all of them have the same will be

1/5 x 1/5 x 1/5 x 5= 1/25 or 0.04

```
>drawtest3(5,4,4,4,4)
Intersect p-value
1 0 0.0000
2 1 0.0000
3 2 0.4884
4 3 0.4703
5 4 0.0413
[1] 0.0413
```

My result was a little off from the theoretical 0.04. You may get a closer value if you draw more, but it may take longer to compute. The result also says you always have at least 2 balls in common. To get reproducible results, please set a seed before you run this code.

To this end, when you have multiple comparisons, having common genes across many samples are a very unlikely event by chance. Therefore, you may change the condition by relaxing to “present at least in 2 samples” instead of present in all samples. In any case, it is worthwhile to run this kind of simulation before you decide what condition should be applied for gene selection.

## Recent Comments