Archive | Myrimatch RSS for this section

Maximizing IDs by Combining Multiple Search Engines

I have posted a few times regarding multiple search engines to increase PSMs (peptide spectrum matches). There are quite a few search engines out there but using all of them seem to be unreasonable. In this post I am going to discuss about how to maximize IDs using multiple search programs.  First, the search programs for mass spec can be categorized into 3 kinds.

1) Search against protein sequence database
2) Search against spectra libraries
3) denovo sequencing

Well known sequence search engines such as SEQUEST, X!tandem, MSGF+ and Myrimatch are based on searching against theoretical fragmentation of peptide generated from sequence database, so they are belong to 1).
Sample prep for mass spec is time consuming and costly, while computation is getting faster and faster. If you have access to cloud computer or cluster, you will get your searches done quickly.

HOW MANY SEARCH PROGRAMS SHOULD WE USE?

The short answer is….. depends on your computational power. If you have unlimited computer resources, you can use as many as search engines you want. According to David Shteynberg’s paper (MCP 12.9, 2013, 2383-2393), the more engines you add, the more IDs you get even at strict false discovery rate. They tested combination of multiple engines including SEQUEST, Inspect, X!tandem, MASCOT, Myrimatch and OMSSA. Many people are familiar with these search  programs. Their results show that you get maximum IDs when you search with all these programs. Since you may not have such computational resources to perform 6 searches per mass spec sample, you may want to know individual performance. If you use only one program, the best performer to worst performer is

1) SEQUEST
2) Myrimatch
3) X!tandem
4) OMSSA
5) Inspect

I believe the results vary with different samples and parameters (e.g. modification), therefore one should be cautious about which ones should be chosen. For example, you can specify precursor ion tolerance asymmetrically for X!tandem (e.g -0.5 and +2.0m/z) and it will give better results than symmetric error tolerance. Some programs don’t allow such an option (e.g. myrimatch).  Nevertheless the performance above is somewhat similar to what I experienced too. I routinely use MSGF+ and MSGF+ usually perform better than most of programs with similar FDR. That’s why currently my default search is with MSGF+,  Myrimatch and X!tandem. Anyway, If you want to use two programs from the list, 1) + 2) works the best as expected. For three programs, 1) + 2) + 3), 1) + 2) +4), 2) + 3) + 4) and 1)+2) +5) perform similarly.
multiple_search_engine
Shteynberg et al., (MCP, 2013)

It is interesting to note that two programs SEQUEST and X!tandem perform well by itself, combining them didn’t do so well.  In fact, InSpect is the worst performer by itself, but if you combine InSpect with SEQUEST, they perform pretty decently. The authors mentioned that if two algorithms with similar algorithms such as SEQUEST and X!tandem are used, they don’t necessary performs better than using two programs with more different algorthms.

SPECTRAL LIBRARY SEARCH PROGRAM SHOULD BE INCLUDED IF YOU CAN

Spectral library search is very different from database search program in terms of algorithm and very sensitive because it actually compares to real spectrum obtained by mass spectrometry. Database search programs create peptide sequences based on enzyme specificity (normally Trypsin) and generate artificial spectrum (-y and -b ions).  If precursor ion m/z is within the error tolerance, and the artificially generated spectrum match to your ms/ms spectrum, you get IDs. Fragmentation pattern may look quite different in a real life and if it is the case, you don’t get IDs.  Unfortunately, fragmentation depends on the type of instrument (ion trap/collision cell) and fragmentation method (CID, PQD, ETD, HCD). If you have phosphopeptide enriched samples, it may not work well unless it contains such spectrum. If you go to National Institute of Standard Technology (NIST) website, there are MS/MS spectral libraries for certain instrument and species.

NIST_spectrum_library

The list is pretty short at this moment-, but I believe it will grow more in the future. There is another website that contains spectral libraries such as Peptide Atlas and X!Hunters.

In the Shteynberg’s paper, they compared SpectraST, a spectral library search program with 6 search engines combined. Surprisingly, SpectraST search (with Human  spectral library) gives quite a few more IDs than 6 programs combined (15% more). In the end, if they combine SpectraST and 6 search programs combined, they got even more IDs (25% more than 6 search programs combined) .

Bottom of the line

One can increase the number of IDs with high confidence by combining multiple search engines. The number of programs used will be dependent on the computational resources he/she has. If one uses an instrument and species matches to the one in the spectral library, he/she should consider spectra library search as it will likely increase the number of IDs.

Power of Parallel Computation-cont’d

The paper published by Ham et al. in 2008 Journal of Proteome Research showed how many replicates you have to run to find all proteins in the sample.  They actually ran 10 technical replicates of moderately complex peptide samples from Shewanella extract and analyzed them with LTQ-orbi and 11T-FCR. When I read it the first time, I was surprised at the results. It took them at least 6 technical replicates to identify most detectable proteins (>95%).  They identified a total of >8000 unique peptides but each dataset had ~4000 peptides. That is, a single mass spec run only identifies 50% of identifiable peptides. Even if you run duplicates, that is still 60-70%. Even though new ms instruments are faster and more sensitive, there are far more complex samples people want to analyze. What this means is that we are merely scratching the surface of the entire proteome of higher organisms.

Screen Shot 2013-08-22 at 10.09.59 PM

If I want to identify as many peptides as possible, running replicates is definitely not the best way. The first run can be the normal run as this paper, but second, and third run should focus on the ions that were not selected in the previous runs. I thought about this idea long ago, but my instrument, which runs on XCalibur, cannot have a parent ion exclusion list of more than 100. It was also a little cumbersome to implement to even exclude just 100 ions. However, it won’t be so difficult to create an exclusion list from the first run. Let’s say you exclude 90% of most intense ions. However, to successfully exclude the ions found in the first run, the software needs  to be able to handle the slight change in retention time in the following run. This could be tough because even if you have spike controls to adjust retention times, these ions have to be found first.  I wonder how it will be if the second run only focuses on low-abundance ions and ignores signals above certain levels.  I can guess the problem is that too many ions are coming off the column at the same time in some part of the gradient and the machine simply cannot pick them all up.

In any case, 6 technical replicates for every complex sample is too costly. Even though duplicates are a minimum requirement, ideally triplicates or more are necessary. Maybe multiplexing of multiple samples and triplicate runs would be the way to go in the future.

The number of protein identifications does not only depend on technical replicates.  It also depends on which search engine you use.  If you use search engines with different algorithms, the IDs differ quite dramatically. Well known programs such as X! tandem, SEQUEST and MASCOT give quite different peptide/protein list. For example, David Tabb’s group published a paper in 2007 describing the new search engine MyriMatch. It identifies more proteins than X!tandem and Sequest. But each search engine pair has only a 60~70% overlap. I have seen similar number with MSGFDB+.  Because computational analysis is much cheaper than running mass spec instruments, it is wasteful not to search with at lease several search algorithms before you move on.

Screen Shot 2013-08-22 at 10.52.23 PM

In the previous post, I described the power of parallel computing. If you have unlimited computing resources, most of searches are done pretty quickly.  Here I am going to implement three batch searches for each mass spec sample using X! tandem, MSGFDB+ and Myrimatch. All search programs are freely available in linux environment (windows also), but I have an access to computer cluster running on linux. All scripts described here will be shell scripts.

I downloaded MSGDF+ from here. I modified the batch search program from X!tandem.

Screen Shot 2013-08-22 at 11.15.52 PM

For Myrimatch, the program I downloaded from Tabb’s website gave a segmentation error when I executed it. So I used a different version downloaded from here. Batch search for myrimatch looks like this:

Screen Shot 2013-08-22 at 11.20.05 PM

To run all three search engines together, you can simply put all three scripts in one file but you probably have to change the file path for each search program.

myrimatchtandemmsgf

In this script, you need to have X! tandem configuration files (default_input.xml, input.xml and taxonomy.xml), myrimatch configuration file (myrimatch1.cfg) and MSGF+ configuration file (Mods.txt) in current directory.  In the future, I will modify the program so that you don’t have to change these configuration files before you start searches.

%d bloggers like this: