Power of Parallel Computation
Average person may think computer is getting faster and faster every year. That is true, but in reality computer itself is not getting so much faster. Actually speed of computation depends on clock speed (among other things) but it cannot go too high (something like 3.2GHz). Above that it simply uses too much electricity and generates too much heat. If you look around, new computers don’t have much higher clock speed than 10 years ago. The major difference is the use of multi-core CPU chips. Your computer may be equipped with dual, quad core or octa core…..
Super computer or large computer clusters have thousands or tens of thousands of core CPUs. Individual CPU may not be so much faster than your computer, but if you have a large task, you can divide the work to many CPUs to finish quickly. Memory usage is also the key part of fast computation. Big chunk of computation time could involve just transferring data back and forth between memories.
There are four types of memory in the computer.
L1-L2 is a small size memory on the chip with CPU. It is extremely fast but it can hold only 32-256 bytes. The next one is DRAM, DRAM is what people usually talk about the memory for the computer. It is fast but it is still expensive compared to other storage. Solid state drive (SSD) and DISK have much larger space and they are cheap. But speed of access is slower.
In order to maximize the computational capacity, one can control memory usage when he/she writes a program. Mass spec search engines are capable of using multi-core CPUs. If you have an access to cloud computer (e.g. Amazon) or large computer cluster, you can write a script to submit a bunch of batch searches.
In the previous posts, I described the way to automate X!tandem search. However, this method is written to run the tandem search sequentially. If you have limited number of cores, this is fine. But if you have many cores available, you can submit each search as one batch. If you do this, as soon as CPUs are available, next search will be started. All you need is to change the one code in the program.
To submit a job in linux, the command you need to use is qsub command.
In the previous X!tandem search on Linux post, I explained how to run automated X!tandem search. In line 6,
We need to change to
qsub -l i,h_rt=4:00:00 -cwd -b y ./tandem.exe input2.xml
Because all searches are submitted individually, if there are enough nodes available for search, the search time for all samples is essentially the same as the one which takes longest in the samples. Imagine if you have 20 mass spec samples, and search the database with three search engines. This will be 20×3=60 searches. If you do this sequentially, it will take 60x time to compute, assuming each search engine spends equal time to search (in reality, some search engine is much faster than the others). The power of parallel computing is exposed when you can divide your job into multiple of small jobs and have enough CPUs available.
Program bug-fix on 08/21/13
There is a bug in batch searches using qsub command. If there are enough nodes to run all batches immediately, the current program works just fine. However, if available nodes are limited, and some batch searches are put on queue, it will search the same mgf file over and over. To fix, the program needs to generate input.xml file for each search. This is very simply to do, so I add the fixes here.
Basically, I add numeric variable $n. After every batch search is submitted, n is increased by 1. A new input.xml file is created with a numeric postfix .