Power of Parallel Computation-cont’d

The paper published by Ham et al. in 2008 Journal of Proteome Research showed how many replicates you have to run to find all proteins in the sample.  They actually ran 10 technical replicates of moderately complex peptide samples from Shewanella extract and analyzed them with LTQ-orbi and 11T-FCR. When I read it the first time, I was surprised at the results. It took them at least 6 technical replicates to identify most detectable proteins (>95%).  They identified a total of >8000 unique peptides but each dataset had ~4000 peptides. That is, a single mass spec run only identifies 50% of identifiable peptides. Even if you run duplicates, that is still 60-70%. Even though new ms instruments are faster and more sensitive, there are far more complex samples people want to analyze. What this means is that we are merely scratching the surface of the entire proteome of higher organisms.

Screen Shot 2013-08-22 at 10.09.59 PM

If I want to identify as many peptides as possible, running replicates is definitely not the best way. The first run can be the normal run as this paper, but second, and third run should focus on the ions that were not selected in the previous runs. I thought about this idea long ago, but my instrument, which runs on XCalibur, cannot have a parent ion exclusion list of more than 100. It was also a little cumbersome to implement to even exclude just 100 ions. However, it won’t be so difficult to create an exclusion list from the first run. Let’s say you exclude 90% of most intense ions. However, to successfully exclude the ions found in the first run, the software needs  to be able to handle the slight change in retention time in the following run. This could be tough because even if you have spike controls to adjust retention times, these ions have to be found first.  I wonder how it will be if the second run only focuses on low-abundance ions and ignores signals above certain levels.  I can guess the problem is that too many ions are coming off the column at the same time in some part of the gradient and the machine simply cannot pick them all up.

In any case, 6 technical replicates for every complex sample is too costly. Even though duplicates are a minimum requirement, ideally triplicates or more are necessary. Maybe multiplexing of multiple samples and triplicate runs would be the way to go in the future.

The number of protein identifications does not only depend on technical replicates.  It also depends on which search engine you use.  If you use search engines with different algorithms, the IDs differ quite dramatically. Well known programs such as X! tandem, SEQUEST and MASCOT give quite different peptide/protein list. For example, David Tabb’s group published a paper in 2007 describing the new search engine MyriMatch. It identifies more proteins than X!tandem and Sequest. But each search engine pair has only a 60~70% overlap. I have seen similar number with MSGFDB+.  Because computational analysis is much cheaper than running mass spec instruments, it is wasteful not to search with at lease several search algorithms before you move on.

Screen Shot 2013-08-22 at 10.52.23 PM

In the previous post, I described the power of parallel computing. If you have unlimited computing resources, most of searches are done pretty quickly.  Here I am going to implement three batch searches for each mass spec sample using X! tandem, MSGFDB+ and Myrimatch. All search programs are freely available in linux environment (windows also), but I have an access to computer cluster running on linux. All scripts described here will be shell scripts.

I downloaded MSGDF+ from here. I modified the batch search program from X!tandem.

Screen Shot 2013-08-22 at 11.15.52 PM

For Myrimatch, the program I downloaded from Tabb’s website gave a segmentation error when I executed it. So I used a different version downloaded from here. Batch search for myrimatch looks like this:

Screen Shot 2013-08-22 at 11.20.05 PM

To run all three search engines together, you can simply put all three scripts in one file but you probably have to change the file path for each search program.


In this script, you need to have X! tandem configuration files (default_input.xml, input.xml and taxonomy.xml), myrimatch configuration file (myrimatch1.cfg) and MSGF+ configuration file (Mods.txt) in current directory.  In the future, I will modify the program so that you don’t have to change these configuration files before you start searches.

About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: