Determining Unknown Peptide Identification & Modification- Who is the best?
Proteome Informatics Research Group (iPRG) held a competition for peptide identification and modification among experts in the Proteomics and Bioinformatics field. The results for the 2012 competition were recently published in MCP.
First, 24 participants were given ~18000 spectra and protein sequence databases with 42K entries. Decoy sequences were also provided by scrambling amino acid sequences of true databases. The spectra were generated from tryptic digests of yeast lysate plus 70 spike controls with various modifications.
Each participant was encouraged to use whatever methods they like to identify as many CID spectra as possible at a <1% false discovery rate (FDR). For modified peptides, participants were required to report types of modifications and their localizations. However, possible modifications were not named or identified as choices; participants are only told that there are a wide variety of modifications present in the samples, both biological and chemical in nature.
The summary of the results is shown below. First, there is a wide range of peptide spectrum matches (PSMs) among research groups. The highest PSM was >7000 and the lowest PSM was <2500. A total of 13 different search engines were used by different groups. Some groups used multiple search engines.
Many participants used Mascot as search engine, but their performances varied widely. The reason for this is likely due to how each group handled variable modifications. There are several approaches to determine post-translational modifications (PTMs). If you use all possible modifications, the search spaces will be too big and you will likely end up with fewer identifications.
A multiple-path search is another way to determine PTMs, first, search with a few common modifications, and then search unmatched spectra with less common modifications. However, authors noted the multiple-path strategy gave participants a hard time in determining proper FDR and required more manual examination. Different search engines handle certain modifications more efficiently.
Unfortunately, there is no correct answer for identifications of yeast lysate tryptic digests. Therefore authors generated “consensus” PSMs from all PSMs which were identified by at least three groups. Blue bars indicate the number of consensus PSMs identified by each participant. It seems that it is difficult to assess who is the best performer without truly “correct” answers. It is likely that consensus PSMs contains more PSMs from search engines with similar algorithms. If you look at gray and yellow bars, the participants used less common search engines. However, it is interesting to see that the highest number of PSMs can be obtained from a single search engine. This means that there are many factors that affect the final results, therefore optimization is most critical use of different programs.
I wish they had also added spike controls without modifications to see how each participant optimized searches. Nevertheless, it is quite interesting to see how each participant performed in identifying these control peptides. Participants 11211 and 58409 were top performers in total PSMs, but they didn’t do well in identifying spike controls. It seems that localization of modifications is still a difficult task.
In any case, even the best performer couldn’t identify ~1500 consensus PSMs (roughly 20% of the original spectra), and the authors note that there is quite a bit of room to improve each group’s approach.
The original data including spectrum and database files can be downloaded here (use login and password provided in the text). Why don’t you try your own search and see if you can beat these expert participants?