When I was young, video tapes had two formats, VHS or beta. For computer data storage, there were cassette tapes and floppy disks. There were various sizes for floppies back then, I used at least 8″, 5″, 3.5″ and 3″ disks. As technology matured, many of them faded away and the one that remains on the market is the results of survival of the fittest. In terms of transcriptome analysis, DNA microarray has dominated the last decade. Recently, however, NextGen sequencing (NGS) technology has provided a new path for gene expression analysis. In this post, I want to compare gene expression analysis using two platforms: RNA-seq and DNA microarray.
High Reproducibility for RNA-seq and Microarray
When DNA microarray was technology first introduced, spot cDNA microarray was quite variable between arrays and it was necessary to run technical replicates as well as dye-swap experiments. However, in more recent years, both techniques are highly reproducible, and several reports say the technical replicates of the two methods have a correlation higher than 0.99. If you start with the same RNAs, the results are essentially the same. There is no need for technical replicates for either RNA-seq and DNA microarray.
RNA-seq has a wider dynamic range
In the cell, the dynamic range of mRNA abundance is huge. Some mRNAs have only a few copies per cell, while the most abundant ones have >10,000 copies per cell. Before talking about dynamic range, let’s talk about how similar data obtained by RNA-seq and microarray are. Correlation between RNAseq and microarray is usually pretty good. In my survey of dozens of papers, R-square is around 0.8. When log-transformed data for RNA-seq and microarray are plotted, however, they don’t look uniformly distributed around the trend line (see figure below).
Reference: Zhao et al. (2014), PLOS One
This is due to the difference in dynamic range. Dynamic range of RNA-seq is dependent on the depth of sequencing while microarray has more or less a fixed dynamic range. This means that for RNA-seq in theory, if you sequence deep enough, you can get the same dynamic range as the number of actual RNA molecules in the sample.
The majority of recent RNA-seq papers have 10-50million mapped reads on average, and this depth of sequencing is already giving more dynamic range (>10^5) than DNA microarray (10^3-10^4). At the high end, DNA microarray shows saturation, while at low end it suffers loss of signal (smaller signal than actual). In the middle part, these two technologies are highly correlated to each other. These effects are evident in the figure.
RNA-seq is more sensitive than microarray
There are at least a dozen papers which conducted both RNA-seq and microarray and reported that RNA-seq identified significantly more genes than microarray. Illumina says the sensitivity of microarray (human) for the major vendor is equivalent to 2 million mapped reads. While most recent research papers had >10 million mapped reads/sample on average, RNA-seq should provide a lot higher sensitivity than microarray.
How accurate are the data?
In order to find whether the results obtained from RNA-seq or microarray are accurate, quantitative real-time PCR (q-RT-PCR) is most commonly used. If primers are carefully designed so that they amplify only a specific gene with very high amplification efficiency, q-RT-PCR should provide the most accurate abundance of a particular RNA. In one study, RNA-seq and microarray results were validated for 488 significantly changed genes (>2.0fold) using q-RT-PCR.
Correct Total % correct
RNA-seq 415 460 90%
Microarray 314 340 93%
Both 296 312 95%
For both cases, agreement of gene expression change was greater than 90%. If both technologies agree, the accuracy was 95%. I saw in one other study the accuracy was quite lower than this case. However, if the same RNA was saved for q-RT-PCR and primers are carefully designed, similar accuracy should be obtained. If your q-RT-PCR results don’t agree, careful examination of PCR primers and exon/probe levels of RNAseq/microarray results are required.
While the above calculation is not determining actual accuracy, generally speaking, RNAseq proves more accurate in terms of fold change values. I can guess there would be no significant difference in medium to high expression genes between RNA-seq and microarray, however, the lower and higher ends of gene expression are likely more accurate for RNA-seq due to its better dynamic range.
Splicing Variant Detection
Let’s say you find a differentially expressed gene which is potentially very interesting in microarray. You went ahead and tried cloning this gene for further biochemical study, but you found that there are multiple splice variants for this gene. Then you examined the probes for this gene on microarray, and you found that probes only cover shared exons among the splice variants.
In this case, you need to figure out which form(s) of the gene is (are) actually differentially expressed. While this can be done with q-RT-PCR or northern blot analysis, it takes more time and effort to confirm it. The more microarray contains probes for possible variants, the fewer issues with identifying and quantifying specific variants.
RNA-seq is also capable of detecting single nucleotide polymorphisms (SNPs). Although it would be difficult to find de novo SNPs for low abundance RNAs (the error rate for Illumina’s Genome Analyzer is ~1%) , RNA-seq can detect a single nucleotide change as well as change in sequences by RNA-editing.
Is there a downside of RNA-seq?
RNA-seq is highly reliable and has higher dynamic range and sensitivity over microarray. In addition, it is capable of detecting novel splicing variants and mutations. However, RNA-seq is more costly ($300-$1000/sample) than microarrays ($100-200/sample). This is due to the extensive bioinformatic analysis requirement and the use of newer machines for RNA-seq.
There are a lot of tools for RNA-seq analysis and there is not yet one standard protocol. The size of RNA-seq files are much bigger than those for microarray. Normal uncompressed RNA-seq raw files can be easily >5GB while 30-40 times smaller for microarray.
Analysis of RNA-seq data requires extensive bioinformatic skills and computer resources (CPU and RAM). Large file sizes can be prohibitive to share data easily and costly to store, especially for large data sets.
For a quick-and-easy experiment, microarray can provide reliable and sensitive results. With accurate probe annotations and probe designs to distinguish splice variants and detect non-coding RNAs (e.g. miRNA or lincRNA), microarray analysis can get pretty close to what RNA-seq can offer at a significantly lower cost.
While the cost and complication of analysis can improve over time, I am more excited about the advent of third generation (3-gen) sequencing. With a lower rate of sequencing error (not enzyme based), no need of amplification, and deeper sequencing, the future of RNA-seq is certainly promising.
Proteome Informatics Research Group (iPRG) held a competition for peptide identification and modification among experts in the Proteomics and Bioinformatics field. The results for the 2012 competition were recently published in MCP.
First, 24 participants were given ~18000 spectra and protein sequence databases with 42K entries. Decoy sequences were also provided by scrambling amino acid sequences of true databases. The spectra were generated from tryptic digests of yeast lysate plus 70 spike controls with various modifications.
Each participant was encouraged to use whatever methods they like to identify as many CID spectra as possible at a <1% false discovery rate (FDR). For modified peptides, participants were required to report types of modifications and their localizations. However, possible modifications were not named or identified as choices; participants are only told that there are a wide variety of modifications present in the samples, both biological and chemical in nature.
The summary of the results is shown below. First, there is a wide range of peptide spectrum matches (PSMs) among research groups. The highest PSM was >7000 and the lowest PSM was <2500. A total of 13 different search engines were used by different groups. Some groups used multiple search engines.
Many participants used Mascot as search engine, but their performances varied widely. The reason for this is likely due to how each group handled variable modifications. There are several approaches to determine post-translational modifications (PTMs). If you use all possible modifications, the search spaces will be too big and you will likely end up with fewer identifications.
A multiple-path search is another way to determine PTMs, first, search with a few common modifications, and then search unmatched spectra with less common modifications. However, authors noted the multiple-path strategy gave participants a hard time in determining proper FDR and required more manual examination. Different search engines handle certain modifications more efficiently.
Unfortunately, there is no correct answer for identifications of yeast lysate tryptic digests. Therefore authors generated “consensus” PSMs from all PSMs which were identified by at least three groups. Blue bars indicate the number of consensus PSMs identified by each participant. It seems that it is difficult to assess who is the best performer without truly “correct” answers. It is likely that consensus PSMs contains more PSMs from search engines with similar algorithms. If you look at gray and yellow bars, the participants used less common search engines. However, it is interesting to see that the highest number of PSMs can be obtained from a single search engine. This means that there are many factors that affect the final results, therefore optimization is most critical use of different programs.
I wish they had also added spike controls without modifications to see how each participant optimized searches. Nevertheless, it is quite interesting to see how each participant performed in identifying these control peptides. Participants 11211 and 58409 were top performers in total PSMs, but they didn’t do well in identifying spike controls. It seems that localization of modifications is still a difficult task.
In any case, even the best performer couldn’t identify ~1500 consensus PSMs (roughly 20% of the original spectra), and the authors note that there is quite a bit of room to improve each group’s approach.
The original data including spectrum and database files can be downloaded here (use login and password provided in the text). Why don’t you try your own search and see if you can beat these expert participants?