P-value for Intersection of Multiple Circle Venn Diagram

I was making a lot of venn diagrams in the past week and encountered to this question: How likely can the number of the intersection (middle part) of venn diagram occur by chance? Let’s say you have 4 DNA microarray samples (A, B, C and CT). In this dataset, you have 7000 genes  detected in all samples, and 200, 250, and 300 genes were up-regulated in A, B and C sample, respectively. If 3 genes are in common for all three samples, what is the likelihood to have this situation by chance? In other words, what is the chance to have at least 3 genes in common in this dataset.

Detected Genes: 7000
Up-regulated in A: 200
Up-regulated in B: 250
Up-regulated in C: 300
Common in A, B & C: 3

If venn diagram is drawn for this, it look like this:

venn

To answer this question we can
1) Derive statistical formula to calculate likelihood
2) Perform a simulation for a number of times to derive p-value

For me, 2) seems much easier to implement than 1), so I am going to describe how I did to simulate the situation

Here is the R code

## calculate p-value for three circle venn diagram 
## intersection
drawtest3<-function(N,m,a,b,c){
#N=7000 # total number of genes for selection
#m=3 # number of overlaps in all three sets
#a=200 # number of up-regulated genes in sample A
#b=250 # number of up-regulated genes in sample B
#c=300 # number of up-regualted genes in sample C

#create a dataframe to store frequency table
d<-data.frame(0:min(a,b,c),rep(0,min(a,b,c)+1))

for (i in 1:10000){
A=sample(1:N,size=a,replace=FALSE)
B=sample(1:N,size=b,replace=FALSE)
C=sample(1:N,size=c,replace=FALSE)
e<-table((C %in% A)&(C %in% B))["TRUE"]
# if there is no intersection, put 0 instead of NA
if(!complete.cases(e)){
e<-0
}
# add one to the counter for corresponding occurence
d[e+1,2]<-d[e+1,2]+1
}
colnames(d)<-c("Intersect","p-value")
# convert counts to ratios
d[,2]<-d[,2]/10000
# calculate p-value
p<-sum(d[(m+1):(min(a,b,c)+1),2])
#print(d)
return(p)
}

There are 5 arguments in this function to call correctly. In the example above,
>drawtest3(7000,3,200,250,300)
[1] 0.0027

If you want to see actual distribution of p-values, remove the # at the bottom of the code.

Intersect p-value
1 0 0.7255
2 1 0.2374
3 2 0.0344
4 3 0.0027
5 4 0.0000
6 5 0.0000
7 6 0.0000
8 7 0.0000
9 8 0.0000
.....
.....

As you can see, most of the cases you get no intersection (0.7255).  Having 3 in common for all three samples are very unlikely event (0.0027) by chance.  More than 3 genes in common actually never occurred in 10000 trials in this case. To make sure this code is correct, let’t try again in a small scale.

There are 5 balls numbered from 1 to 5 in a box. Three people A, B and C will draw 4 balls from the box. What is the chance that everybody gets the same ball. In this case, the arguments for the function will be

N=5
m=4
A=4
B=4
C=4

To calculate this probability, it is easier to think which ball A didn’t pick. Since A will pick 4 out of 5 balls, the possibilities that A will not pick one ball are 5.

A's pick     Remainder
{1,2,3,4}       5
{1,2,3,5}       4
{1,2,4,5}       3
{1,3,4,5}       2
{2,3,4,5}       1

Now B and C have to do the same as A. To not to pick 5 for both B and C is 1/5×1/5=1/25
So the probability that all of them have the same will be
1/5 x 1/5 x 1/5 x 5= 1/25 or 0.04

>drawtest3(5,4,4,4,4)
 Intersect p-value
1 0 0.0000
2 1 0.0000
3 2 0.4884
4 3 0.4703
5 4 0.0413
[1] 0.0413

My result was a little off from the theoretical 0.04. You may get a closer value if you draw more, but it may take longer to compute. The result also says you always have at least 2 balls in common. To get reproducible results, please set a seed before you run this code.

To this end, when you have multiple comparisons, having common genes across many samples are a very unlikely event by chance. Therefore, you may change the condition by relaxing to “present at least in 2 samples” instead of present in all samples. In any case, it is worthwhile to run this kind of simulation before you decide what condition should be applied for gene selection.

Advertisements

Transfer Large Files on Web to Cluster

When you find large files on the web and need to transfer to a computer cluster, what do you usually do? You can download the files to your desktop first, then transfer them to the cluster using Globus online.  It works, but it is a cumbersome 2-step transfer and requires space on your desktop too. Once transferred, you need to delete the files. In this post, I am going to present two methods to simplify this process.

Mount Your Cluster Account on Mac Desktop

Install FUSE plug-in

We are going to use SSHFS to mount drives. To do it, you need both FUSE plug-in (OSXFUSE) & SSHFS. Both programs can be obtained here. Once you installed the program, you will see an icon for OSXFUSE on system preference window.

FUSE

Create a Directory to Mount the Drive

You can create a directory any where you want. Let’s create a folder called “Cluster” on your desktop.

folder1

Run a Command in Bash

Open terminal, and type

>sshfs -h

If the plug-in is properly installed, you should see

general options:
-o opt,[opt...] mount options
-h --help print help
-V --version print version
.....
.....

no mount point

Then type the following to mount

>sshfs your_cluster_account@abcd.edu:/path/to/cluster/directory /path/to/Desktop/directory

Make sure you have the right path for both the cluster and local directory. If all is correct, you are prompted for a password.

If it is successfully mounted, the folder you created on the desktop will change appearance once you mount the drive to it.

folder2

Caution: You mount the drive always on an empty folder. Anything in the folder will not be accessible once you mount another drive to it.  For example, if you mount a drive on Mac HD, you will likely lose access to most of files, and crash the system.

Try Saving Something

Save the file you want to save by clicking “Save Link As…” You then select the cluster folder on your desktop. It will start saving the file in your cluster account. The image below is an example for downloading compressed fastq files from illumina website.

save

When you mount the drive, make sure to use an empty directory.  If you use a directory that contains files, you will lose access to them. If you accidentally mount on the wrong place, you will need to unmount the drive.

Unmount the Drive

If you made a mistake mounting the drive in a wrong directory or the mounted drive stops responding, you can unmount the drive and remount it. To unmount,

 >umount /path/to/the/mounted/directory

Alternatively, Use a Web Browser on your Cluster to Save

It is not uncommon for unix/unix-like clusters to have web browsers these days. I don’t usually use them to browse the internet because it is kind of slow. However, it requires no installation of software or mounting disk to save files from the web directly to your cluster account. Some people may prefer this.

Ask your Admin to Find which Browser is Available

There are a few common web browsers for linux system. Here is a list of top 10 browsers. In my institution, I found firefox is available on the cluster. To use it, I log in using ssh -X then type firefox. A window will pop up for firefox. Once the window is open, you can use it just like your browser on your local machine, but with slower speed. Because you are using a browser through a cluster, you can save web files directly to your account.

Screen shot 2014-02-05 at 10.36.35 AM

Both methods work pretty well to transfer large files from the web. In my case, the second method was faster than the first method. You can try it and decide which method you like.

Addendum on Mar 10th, 2014

One of my coworkers suggested a Mac application for mounting drives called MacFusion. It is easy to set up. Only things need to be preinstalled are again these two programs.

1) FUSE for mac os X
2) SSHFS

Download from here

MacFusion download

Screen Shot 2014-03-10 at 10.54.51 PM

Set up needs pretty much the same information as above and most of people will have no problem using them if above methods are worked.

RNA-seq vs Microarray: what is the take?

When I was young, video tapes had two formats, VHS or beta. For computer data storage, there were cassette tapes and floppy disks. There were various sizes for floppies back then, I used at least 8″, 5″, 3.5″ and 3″ disks. As technology matured, many of them faded away and the one that remains on the market is the results of survival of the fittest. In terms of transcriptome analysis, DNA microarray has dominated the last decade. Recently, however, NextGen sequencing (NGS) technology has provided a new path for gene expression analysis. In this post, I want to compare gene expression analysis using two platforms: RNA-seq and DNA microarray.

High Reproducibility for RNA-seq and Microarray

When DNA microarray was technology first introduced, spot cDNA microarray was quite variable between arrays and it was necessary to run technical replicates as well as dye-swap experiments. However, in more recent years, both techniques are highly reproducible, and several reports say the technical replicates of the two methods have a correlation higher than 0.99. If you start with the same RNAs, the results are essentially the same. There is no need for technical replicates for either RNA-seq and DNA microarray.

Screen shot 2014-01-27 at 3.33.10 PM

RNA-seq has a wider dynamic range

In the cell, the dynamic range of mRNA abundance is huge. Some mRNAs have only a few copies per cell, while the most abundant ones have >10,000 copies per cell. Before talking about dynamic range, let’s talk about how similar data obtained by RNA-seq and microarray are. Correlation between RNAseq and microarray is usually pretty good. In my survey of dozens of papers, R-square is around 0.8. When log-transformed data for RNA-seq and microarray are plotted, however, they don’t look uniformly distributed around the trend line (see figure below).
Screen shot 2014-01-27 at 1.58.44 PM

Reference: Zhao et al. (2014), PLOS One

This is due to the difference in dynamic range. Dynamic range of RNA-seq is dependent on the depth of sequencing while microarray has more or less a fixed dynamic range. This means that for RNA-seq in theory, if you sequence deep enough, you can get the same dynamic range as the number of actual RNA molecules in the sample.

The majority of recent RNA-seq papers have 10-50million mapped reads on average, and this depth of sequencing is already giving more dynamic range (>10^5) than DNA microarray (10^3-10^4). At the high end, DNA microarray shows saturation, while at low end it suffers loss of signal (smaller signal than actual). In the middle part, these two technologies are highly correlated to each other. These effects are evident in the figure.

RNA-seq is more sensitive than microarray

There are at least a dozen papers which conducted both RNA-seq and microarray and reported  that RNA-seq identified significantly more genes than microarray. Illumina says the sensitivity of microarray (human) for the major vendor is equivalent to 2 million  mapped reads. While most recent research papers had >10 million mapped reads/sample on average, RNA-seq should provide a lot higher sensitivity than microarray.

How accurate are the data?

In order to find whether the results obtained from RNA-seq or microarray are accurate, quantitative real-time PCR (q-RT-PCR) is most commonly used. If primers are carefully designed so that they amplify only a specific gene with very high amplification efficiency, q-RT-PCR should provide the most accurate abundance of a particular RNA. In one study, RNA-seq and microarray results were validated for 488 significantly changed genes (>2.0fold) using q-RT-PCR.

                        Correct       Total        % correct
RNA-seq          415             460           90%
Microarray       314             340           93%
Both                 296             312           95%

For both cases, agreement of gene expression change was greater than 90%. If both technologies agree, the accuracy was 95%. I saw in one other study the accuracy was quite lower than this case. However, if the same RNA was saved for q-RT-PCR and primers are carefully designed, similar accuracy should be obtained. If your q-RT-PCR results don’t agree, careful examination of PCR primers and exon/probe levels of RNAseq/microarray results are required.

While the above calculation is not determining actual accuracy, generally speaking, RNAseq proves more accurate in terms of fold change values. I can guess there would be no significant difference in medium to high expression genes between RNA-seq and microarray, however, the lower and higher ends of gene expression are likely more accurate for RNA-seq due to its better dynamic range.

Splicing Variant Detection

Let’s say you find a differentially expressed gene which is potentially very interesting in microarray. You went ahead and tried cloning this gene for further biochemical study, but you found that there are multiple splice variants for this gene. Then you examined the probes for this gene on microarray, and you found that probes only cover shared exons among the splice variants.

splice_variant

In this case, you need to figure out which form(s) of the gene is (are) actually differentially expressed. While this can be done with q-RT-PCR or northern blot analysis, it takes more time and effort to confirm it. The more microarray contains probes for possible variants, the fewer issues with identifying and quantifying specific variants.

SNP Detection

RNA-seq is also capable of detecting single nucleotide polymorphisms (SNPs). Although it would be difficult to find de novo SNPs for low abundance RNAs (the error rate for Illumina’s Genome Analyzer is ~1%) , RNA-seq can detect a single nucleotide change as well as change in sequences by RNA-editing.

Is there a downside of RNA-seq?

RNA-seq is highly reliable and has higher dynamic range and sensitivity over microarray. In addition, it is capable of detecting novel splicing variants and mutations. However, RNA-seq is more costly ($300-$1000/sample) than microarrays ($100-200/sample). This is due to  the extensive bioinformatic analysis requirement and the use of newer machines for RNA-seq.

There are a lot of tools for RNA-seq analysis and there is not yet one standard protocol. The size of RNA-seq files are much bigger than those for microarray. Normal uncompressed RNA-seq raw files can be easily >5GB while 30-40 times smaller for microarray.

Analysis of RNA-seq data requires extensive bioinformatic skills and computer resources (CPU and RAM). Large file sizes can be prohibitive to share data easily and costly to store, especially for large data sets.

Bottom line

For a quick-and-easy experiment, microarray can provide reliable and sensitive results. With accurate probe annotations and probe designs to distinguish splice variants and detect non-coding RNAs (e.g. miRNA or lincRNA), microarray analysis can get pretty close to what RNA-seq can offer at a significantly lower cost.

While the cost and complication of analysis can improve over time, I am more excited about the advent of third generation (3-gen) sequencing. With a lower rate of sequencing error (not enzyme based), no need of amplification, and deeper sequencing, the future of RNA-seq is certainly promising.

Determining Unknown Peptide Identification & Modification- Who is the best?

Proteome Informatics Research Group (iPRG) held a competition for peptide identification and modification among experts in the Proteomics and Bioinformatics field. The results for the  2012 competition were recently published in MCP.

First, 24 participants were given ~18000 spectra and protein sequence databases with 42K entries. Decoy sequences were also provided by scrambling amino acid sequences of true databases. The spectra were generated from tryptic digests of yeast lysate plus 70 spike controls with various modifications.

Each participant was encouraged to use whatever methods they like to identify as many CID spectra as possible at a <1% false discovery rate (FDR). For modified peptides, participants were required to report types of modifications and their localizations.  However, possible modifications were not named or identified as choices; participants are only told that there are a wide variety of modifications present in the samples, both biological and chemical in nature.

The summary of the results is shown below. First, there is a wide range of peptide spectrum matches (PSMs) among research groups. The highest PSM was >7000 and the lowest PSM was <2500. A total of 13 different search engines were used by different groups. Some groups used multiple search engines. 
Screen shot 2014-01-07 at 10.58.40 AM

Many participants used Mascot as search engine, but their performances varied widely. The reason for this is likely due to how each group handled variable modifications. There are several approaches to determine post-translational modifications (PTMs). If you use all possible modifications, the search spaces will be too big and you will likely end up with fewer identifications.

A multiple-path search is another way to determine PTMs, first, search with a few common modifications, and then search unmatched spectra with less common modifications. However, authors noted the multiple-path strategy gave participants a hard time in determining proper FDR and required more manual examination. Different search engines handle certain modifications more efficiently.

Unfortunately, there is no correct answer for identifications of yeast lysate tryptic digests. Therefore authors generated “consensus” PSMs from all PSMs which were identified by at least three groups. Blue bars indicate the number of consensus PSMs identified by each participant. It seems that it is difficult to assess who is the best performer without truly “correct” answers. It is likely that consensus PSMs contains more PSMs from search engines with similar algorithms. If you look at gray and yellow bars, the participants used less common search engines. However, it is interesting to see  that the highest number of PSMs can be obtained from a single search engine. This means that there are many factors that affect the final results, therefore optimization is most critical use of different programs.

Screen shot 2014-01-07 at 11.31.24 AM

I wish they had also added spike controls without modifications to see how each participant optimized searches. Nevertheless, it is quite interesting to see how each participant performed in identifying these control peptides. Participants 11211 and 58409 were top performers in total PSMs, but they didn’t do well in identifying spike controls. It seems that localization of modifications is still a difficult task.

In any case, even the best performer couldn’t identify ~1500 consensus PSMs (roughly 20% of the original spectra), and the authors note that there is  quite a bit of room to improve each group’s approach.

The original data including spectrum and database files can be downloaded here (use login and password provided in the text). Why don’t you try your own search and see if you can beat these expert participants?

Making the Best of PubMed Search

The PubMed website has been constantly improved over time, so many more features have been added since I started using it more than 15 years ago. In this post, I want to discuss about how to improve your PubMed search in order to get what you want.

PubMed records contain more than 23 million citations as of Dec 2013 for biomedical literature. Without proper search terms, you will get overwhelmed quickly. PubMed has extensive tutorial sections, but you may not have time to watch all videos and read the documents. Many parts of PubMed are quite intuitive, so I am not going to discuss the features people can easily figure out.

Use Special Characters Properly to Enhance Your Search

In PubMed search, you have several special characters which make your search more specific. These are “” (double quotation), AND/OR/NOT (boolean), * (asterisk), parenthesis ( ) and square brackets [ ].

Put Words between Double Quotations to Search Phrases

Similar to google search, if you place words between double quotations, PubMed will search the articles which contains the specific phrase. For example, “Lung cancer” will not match lung and breast cancer, which would be matched without the double quotations. PubMed search is NOT case sensitive, so it doesn’t matter if any part of the word is capitalized.

Boolean Operators To Retrieve Intersections & Unions

AND/OR/NOT will be recognized as a boolean operator whether or not they are capitalized. If you want an intersection of the two terms, you use AND. OR works as a union, so you will get articles that contain one of the terms or both. If you have too many hits, NOT is useful to refine your search.

Screen Shot 2013-12-20 at 4.03.24 PM

Using Asterisk For Ambiguous Search

Asterisk (*) is called a truncation symbol in PubMed search.  It works like a wild card in regular expression.  If you want to perform an ambiguous search, this is useful. For example, ferment* will match fermented, fermentation, fermenting and any words that start with ferment.

Parenthesis with Boolean Operators For Complex Searches

Words in parenthesis are considered a set. By combining parenthesis and boolean operators, you can perform more complex searches.

Example: vaccine that may either cause red skin or muscle pain

Answer: vaccine AND (red skin OR muscle pain)

It is up to you whether use double quotations to be more specific or not.

In an advanced search, you can search complex searches from your history using boolean operators. If you are doing complicated searches or you want to exclude references you have already seen, this may be useful.

Screen Shot 2013-12-20 at 8.13.07 PM

Square Brackets To Specify Search Field

You probably know that the interpretation of your search term is shown in the box of search details on the right. You notice that every word is followed by square brackets that contain field information. Try remembering common field named so that you can do advance search without using the advance search page.  Examples: [Author], [Title], [Text], [ad], [Journal]

Finding the Right Author with a Common Name

It happens often after going to a meeting, I met a person and want to find his/her paper(s) on PubMed. But his/her name is very common and if you know only the initial of the first name, PubMed search will give huge number of hits.

Use [ad] option to search

Even if the author you are looking for has a common last name, if you know the part of the name of his/her institutional association, it will significantly help the search.

Use the Journal Name

If you know the name of the journal in which the author published, you can use [Journal] option to search in addition to the author name.

Finding the Correct Abbreviation of Journal Title

Journal abbreviation is commonly used in citation in CV and many other places. You can also use it in PubMed search. However, you need to use the exact abbreviation for each journal, otherwise the search will not return the correct results. First, you need to go to PubMed main page and click “Journals in NCBI Database” in the “more resources” column on the right. If you cannot find it, click here.

Next, type the journal title in the search field. If you see the exact title when you are typing in the name, select it and enter. You will see two abbreviation types for The Journal of Biological Chemistry (known as JBC) below. You cannot use the term “JBC” to search JBC articles, you need to use one of the ones in the red rectangle. For PubMed search, both abbreviations in the red rectangle will work fine.

journal_abbreviation

Want to Find a Trend in a Particular Research Area

trends

PubMed results show the trend by year only if the search term contains more than 10,000 citations. If fewer, the search doesn’t show the trend. Fortunately, there is a website which does the same job without such a limitation. Click the following URL and type in the search term (it takes a little time to show the results).

http://dan.corlan.net/medline-trend.html

Screen Shot 2013-12-19 at 10.36.00 PM

Copy the results and paste in text editor (e.g. word, notepad). Then save as text file (.txt).

Run excel and open the text file you just saved. When you open the document, excel will ask you if you want to separate the field by certain characters. Select “Delimited” and go next.

Then click space (see below).

trend_search_excel

Now all data are placed in each cell, so you can manipulate the data and create a trend graph.

Be Greedy When You Find a Good Reference

If you find a good hit with a PubMed search, try clicking the related citations. It is possible you may find more articles that you like.

related_citation

Mobile App for PubMed Search

I found this pretty easy to use and convenient for mobile users. You can make comments on articles and save on your mobile.

Screen Shot 2013-12-31 at 2.20.12 PM

%d bloggers like this: