I have posted a few times regarding multiple search engines to increase PSMs (peptide spectrum matches). There are quite a few search engines out there but using all of them seem to be unreasonable. In this post I am going to discuss about how to maximize IDs using multiple search programs. First, the search programs for mass spec can be categorized into 3 kinds.
1) Search against protein sequence database
2) Search against spectra libraries
3) denovo sequencing
Well known sequence search engines such as SEQUEST, X!tandem, MSGF+ and Myrimatch are based on searching against theoretical fragmentation of peptide generated from sequence database, so they are belong to 1).
Sample prep for mass spec is time consuming and costly, while computation is getting faster and faster. If you have access to cloud computer or cluster, you will get your searches done quickly.
HOW MANY SEARCH PROGRAMS SHOULD WE USE?
The short answer is….. depends on your computational power. If you have unlimited computer resources, you can use as many as search engines you want. According to David Shteynberg’s paper (MCP 12.9, 2013, 2383-2393), the more engines you add, the more IDs you get even at strict false discovery rate. They tested combination of multiple engines including SEQUEST, Inspect, X!tandem, MASCOT, Myrimatch and OMSSA. Many people are familiar with these search programs. Their results show that you get maximum IDs when you search with all these programs. Since you may not have such computational resources to perform 6 searches per mass spec sample, you may want to know individual performance. If you use only one program, the best performer to worst performer is
I believe the results vary with different samples and parameters (e.g. modification), therefore one should be cautious about which ones should be chosen. For example, you can specify precursor ion tolerance asymmetrically for X!tandem (e.g -0.5 and +2.0m/z) and it will give better results than symmetric error tolerance. Some programs don’t allow such an option (e.g. myrimatch). Nevertheless the performance above is somewhat similar to what I experienced too. I routinely use MSGF+ and MSGF+ usually perform better than most of programs with similar FDR. That’s why currently my default search is with MSGF+, Myrimatch and X!tandem. Anyway, If you want to use two programs from the list, 1) + 2) works the best as expected. For three programs, 1) + 2) + 3), 1) + 2) +4), 2) + 3) + 4) and 1)+2) +5) perform similarly.
Shteynberg et al., (MCP, 2013)
It is interesting to note that two programs SEQUEST and X!tandem perform well by itself, combining them didn’t do so well. In fact, InSpect is the worst performer by itself, but if you combine InSpect with SEQUEST, they perform pretty decently. The authors mentioned that if two algorithms with similar algorithms such as SEQUEST and X!tandem are used, they don’t necessary performs better than using two programs with more different algorthms.
SPECTRAL LIBRARY SEARCH PROGRAM SHOULD BE INCLUDED IF YOU CAN
Spectral library search is very different from database search program in terms of algorithm and very sensitive because it actually compares to real spectrum obtained by mass spectrometry. Database search programs create peptide sequences based on enzyme specificity (normally Trypsin) and generate artificial spectrum (-y and -b ions). If precursor ion m/z is within the error tolerance, and the artificially generated spectrum match to your ms/ms spectrum, you get IDs. Fragmentation pattern may look quite different in a real life and if it is the case, you don’t get IDs. Unfortunately, fragmentation depends on the type of instrument (ion trap/collision cell) and fragmentation method (CID, PQD, ETD, HCD). If you have phosphopeptide enriched samples, it may not work well unless it contains such spectrum. If you go to National Institute of Standard Technology (NIST) website, there are MS/MS spectral libraries for certain instrument and species.
In the Shteynberg’s paper, they compared SpectraST, a spectral library search program with 6 search engines combined. Surprisingly, SpectraST search (with Human spectral library) gives quite a few more IDs than 6 programs combined (15% more). In the end, if they combine SpectraST and 6 search programs combined, they got even more IDs (25% more than 6 search programs combined) .
Bottom of the line
One can increase the number of IDs with high confidence by combining multiple search engines. The number of programs used will be dependent on the computational resources he/she has. If one uses an instrument and species matches to the one in the spectral library, he/she should consider spectra library search as it will likely increase the number of IDs.
This post is going to be a short one . When you run X!tandem search on linux, you get output files with .t.xml extension. This file can be open on gpm-web site to see the models. It can be also opened in PeptideShaker. But if you want to use IDPicker to look at the results, you need to convert the file to xml file that is compatible with the software. The utility software for conversion is called Tandem2XML.exe, and it is one of tools used in TPP. If you have already installed TPP on your PC, you can find the program in the following directory.
You can also download it from here (may not be fast). To use this in the command line, you can type
>Tandem2XML.exe [FILE_PATH\INPUT_FILE_NAME.t.xml] [FILE_PATH\OUTPUT_FILE_NAME.xml]
Once the file is converted to xml file (pepXML), you can import it to IDPicker. This utility is simply to use, but typing file path and name for multiple files is pretty cumbersome. So I wrote a script which will find all .t.xml files in a specified directory and automatically convert to .xml.
>:: read file names from the directory and create a new file with all file names
>dir -F %msfd%\*.t.xml /a:-d /b>file_name.txt
>for /f “tokens=*” %%l in (file_name.txt) do Tandem2XML.exe %msfd%\%%l %msfd%\%%l.xml
You can save this script as AutoTandem2XML.bat and place the Tandem2XML.exe in the same directory as this program.
To use, just change the second line for the directory containing .t.xml files. When you execute either by double clicking the program or running in the command line, first it creates a file file_name.txt which contains the file names ending with .t.xml.
Then it will convert the .t.xml to IDPicker compatible xml file in the same directory. You may see some error messages like output files will not contain retention time. But the output files still works in IDPicker.
The paper published by Ham et al. in 2008 Journal of Proteome Research showed how many replicates you have to run to find all proteins in the sample. They actually ran 10 technical replicates of moderately complex peptide samples from Shewanella extract and analyzed them with LTQ-orbi and 11T-FCR. When I read it the first time, I was surprised at the results. It took them at least 6 technical replicates to identify most detectable proteins (>95%). They identified a total of >8000 unique peptides but each dataset had ~4000 peptides. That is, a single mass spec run only identifies 50% of identifiable peptides. Even if you run duplicates, that is still 60-70%. Even though new ms instruments are faster and more sensitive, there are far more complex samples people want to analyze. What this means is that we are merely scratching the surface of the entire proteome of higher organisms.
If I want to identify as many peptides as possible, running replicates is definitely not the best way. The first run can be the normal run as this paper, but second, and third run should focus on the ions that were not selected in the previous runs. I thought about this idea long ago, but my instrument, which runs on XCalibur, cannot have a parent ion exclusion list of more than 100. It was also a little cumbersome to implement to even exclude just 100 ions. However, it won’t be so difficult to create an exclusion list from the first run. Let’s say you exclude 90% of most intense ions. However, to successfully exclude the ions found in the first run, the software needs to be able to handle the slight change in retention time in the following run. This could be tough because even if you have spike controls to adjust retention times, these ions have to be found first. I wonder how it will be if the second run only focuses on low-abundance ions and ignores signals above certain levels. I can guess the problem is that too many ions are coming off the column at the same time in some part of the gradient and the machine simply cannot pick them all up.
In any case, 6 technical replicates for every complex sample is too costly. Even though duplicates are a minimum requirement, ideally triplicates or more are necessary. Maybe multiplexing of multiple samples and triplicate runs would be the way to go in the future.
The number of protein identifications does not only depend on technical replicates. It also depends on which search engine you use. If you use search engines with different algorithms, the IDs differ quite dramatically. Well known programs such as X! tandem, SEQUEST and MASCOT give quite different peptide/protein list. For example, David Tabb’s group published a paper in 2007 describing the new search engine MyriMatch. It identifies more proteins than X!tandem and Sequest. But each search engine pair has only a 60~70% overlap. I have seen similar number with MSGFDB+. Because computational analysis is much cheaper than running mass spec instruments, it is wasteful not to search with at lease several search algorithms before you move on.
In the previous post, I described the power of parallel computing. If you have unlimited computing resources, most of searches are done pretty quickly. Here I am going to implement three batch searches for each mass spec sample using X! tandem, MSGFDB+ and Myrimatch. All search programs are freely available in linux environment (windows also), but I have an access to computer cluster running on linux. All scripts described here will be shell scripts.
I downloaded MSGDF+ from here. I modified the batch search program from X!tandem.
For Myrimatch, the program I downloaded from Tabb’s website gave a segmentation error when I executed it. So I used a different version downloaded from here. Batch search for myrimatch looks like this:
To run all three search engines together, you can simply put all three scripts in one file but you probably have to change the file path for each search program.
In this script, you need to have X! tandem configuration files (default_input.xml, input.xml and taxonomy.xml), myrimatch configuration file (myrimatch1.cfg) and MSGF+ configuration file (Mods.txt) in current directory. In the future, I will modify the program so that you don’t have to change these configuration files before you start searches.
Average person may think computer is getting faster and faster every year. That is true, but in reality computer itself is not getting so much faster. Actually speed of computation depends on clock speed (among other things) but it cannot go too high (something like 3.2GHz). Above that it simply uses too much electricity and generates too much heat. If you look around, new computers don’t have much higher clock speed than 10 years ago. The major difference is the use of multi-core CPU chips. Your computer may be equipped with dual, quad core or octa core…..
Super computer or large computer clusters have thousands or tens of thousands of core CPUs. Individual CPU may not be so much faster than your computer, but if you have a large task, you can divide the work to many CPUs to finish quickly. Memory usage is also the key part of fast computation. Big chunk of computation time could involve just transferring data back and forth between memories.
There are four types of memory in the computer.
L1-L2 is a small size memory on the chip with CPU. It is extremely fast but it can hold only 32-256 bytes. The next one is DRAM, DRAM is what people usually talk about the memory for the computer. It is fast but it is still expensive compared to other storage. Solid state drive (SSD) and DISK have much larger space and they are cheap. But speed of access is slower.
In order to maximize the computational capacity, one can control memory usage when he/she writes a program. Mass spec search engines are capable of using multi-core CPUs. If you have an access to cloud computer (e.g. Amazon) or large computer cluster, you can write a script to submit a bunch of batch searches.
In the previous posts, I described the way to automate X!tandem search. However, this method is written to run the tandem search sequentially. If you have limited number of cores, this is fine. But if you have many cores available, you can submit each search as one batch. If you do this, as soon as CPUs are available, next search will be started. All you need is to change the one code in the program.
To submit a job in linux, the command you need to use is qsub command.
In the previous X!tandem search on Linux post, I explained how to run automated X!tandem search. In line 6,
We need to change to
qsub -l i,h_rt=4:00:00 -cwd -b y ./tandem.exe input2.xml
Because all searches are submitted individually, if there are enough nodes available for search, the search time for all samples is essentially the same as the one which takes longest in the samples. Imagine if you have 20 mass spec samples, and search the database with three search engines. This will be 20×3=60 searches. If you do this sequentially, it will take 60x time to compute, assuming each search engine spends equal time to search (in reality, some search engine is much faster than the others). The power of parallel computing is exposed when you can divide your job into multiple of small jobs and have enough CPUs available.
Program bug-fix on 08/21/13
There is a bug in batch searches using qsub command. If there are enough nodes to run all batches immediately, the current program works just fine. However, if available nodes are limited, and some batch searches are put on queue, it will search the same mgf file over and over. To fix, the program needs to generate input.xml file for each search. This is very simply to do, so I add the fixes here.
Basically, I add numeric variable $n. After every batch search is submitted, n is increased by 1. A new input.xml file is created with a numeric postfix .
In the previous post, I explained the method #2 for batch search using X! tandem. In this method, you still have to type file path and file name for all ms/ms data. It only takes a single script to upgrade to method #3, in which the program grabs all spectrum files such as .mgf and create a file containing the path and name in a specified directory.
In Linux, FIND command is used to look for files in a specified directory with a specific string. In this case, you want to find only spectrum files in the directory. The syntax of the command is
find where-to-look condition what-to-do
Let’s specify where-to-look by
Condition is to find files (not directory) with mgf extension.
-type f -name *.mgf
You may not want to look in subdirectory. If so, add
Finally, you write the file names in file_name.txt using “>”. So final script looks like
find $spec_folder -maxdepth 1 -type f -name *.mgf > file_name.txt
In summary to run batch X!tandem search on linux,
1) Put all spectrum files in one directory
2) Edit the first line of the program to specify the directory
3) Edit default_input.xml (if you want to change modifications, error tolerance etc)
4) Edit input.xml (if you want to change species)
5) Edit taxonoy.xml (if you need to add databases)
6) Execute the program (make sure it is executable and place in the same directory as tandem.exe)
Did it work? I hope so!
Final code can be downloaded here.