Archive | August 2013

Solution to Trasferring Extremely Large Files

Bandwidth is a communication speed (bit-rate) when you access data to computer resources. Usually, it is expressed in bit/s, kbit/s, Mbit/s or Gbit/s (these are bits not bytes: 8bit=1byte).  If you have home internet access, you usually pay for the speed. In the case of Timewarner cable, you pay $19 for 1Mbps, $29 for 3Mbps, for $34 for 15MBps per month and so on . One Mbps is 1 mega bit per second, which is 125Kbyte/sec and usually this defines the maximum transfer speed of the data.

As I work at university, I have a pretty fast connection (100Mbps=12.5mega byte per sec). But I hardly get such speed for any transfer. For example, if I use SCP, the transfer rate is usually 80-200kbyte per sec which is about 1-2% of maximum speed. Even with Globus online service, I get about 1-1.5MByte per second. At this rate, transferring 1GB file takes about 10 min. Not so bad, but if I want to transfer 100GB, it will take 16 hrs+. I don’t know if this is tolerable hours to you… It depends on your situation. Certainly, if you wan to move 1TB, this will take 160hrs =6.7 days. The time is assumed if the transfer is done flawlessly. Obviously, you rather send 1TB drive using FedEx overnight instead of using Globus online in this case.

Why can  we use only a fraction of bandwidth for data transfer?

The problem of using FTP or HTTP for data transfer is you are relying on TCP (transmission control protocol). TCP or TCP/IP provides reliable, ordered error-checked delivery over the net. But it is a very very slow protocol. There is an inherent problem of transferring data in the long distance using TCP. It loses packet more frequently and therefore, the speed gets slower.

The sender of the TCP packet has to receive the acknowledgement from the receiver before it sends more data. When the acknowledgement is not received, the sender slows down the transfer try to avoid congestion (even there is no congestion).

Solution to the slow transfer

Aspera is a company which  provides a solution to bottle neck issue using TCP for transfer. Their program is completely independent of network delay and suffers little packet loss even in the long distance (e.g. inter-continent). The program is called fasp, which uses a new large data transfer protocol. In this protocol, even at 10% packet loss, it achieves 90% utilization of bandwidth with minimum redundant data transfer.

It is really fast

They claim the transfer speed is up to 1000 times of standard FTP. The benchmark on their website showed below

aspera_benchmark

You see it if you use their program you can transfer the large files at blazing speed. If you have the fastest bandwidth, 100GB file transfer takes only 1.4 min. Wow!!

 Is it really true?

I tried downloading a few files using Aspera program. Using their program to send your own files is not free, but I can download files from the site that uses Aspera can be free (if they don’t charge for downloading files).  Clinical Proteomic Tumor Analysis Consortium (CPTAC) has data collection of proteomic research that can be freely downloaded from their website. When you download files on their website, you will be able to use Aspera program plug-in for free.

CPTAC_site

It is pretty fast

In attempt to downloading 700MB files (total), it took about 2 min. You can see with my bandwidth of 100Mbps, it uses 43% of total capacity for downloading. 43Mbps is 5.4Mbyte per second and this is 50-100 times faster than FTP and 5 times faster than using Globus online. At this speed, I can download 100GB file under 5 hours. It doesn’t seem to be able to use 90% capacity of bandwidth I have, but it is still significantly faster. If I have gigabit per second connection, this should be done less than 1 hr.  I can certainly see the advantage of using their program for very large file transfer. In fact, large companies such as Netflix uses Aspera as they need to transfer large amount of data everyday.

aspera_transfer

How much does it cost?

It is not free, unfortunately. The good service doesn’t come for free. I tried searching on web for pricing and found that it charges by hour. I guess if you are frequent user, it needs to justify the cost. But if you have very fast connection and have lots of data to transfer, this could be the solution.

aspera_price

Power of Parallel Computation-cont’d

The paper published by Ham et al. in 2008 Journal of Proteome Research showed how many replicates you have to run to find all proteins in the sample.  They actually ran 10 technical replicates of moderately complex peptide samples from Shewanella extract and analyzed them with LTQ-orbi and 11T-FCR. When I read it the first time, I was surprised at the results. It took them at least 6 technical replicates to identify most detectable proteins (>95%).  They identified a total of >8000 unique peptides but each dataset had ~4000 peptides. That is, a single mass spec run only identifies 50% of identifiable peptides. Even if you run duplicates, that is still 60-70%. Even though new ms instruments are faster and more sensitive, there are far more complex samples people want to analyze. What this means is that we are merely scratching the surface of the entire proteome of higher organisms.

Screen Shot 2013-08-22 at 10.09.59 PM

If I want to identify as many peptides as possible, running replicates is definitely not the best way. The first run can be the normal run as this paper, but second, and third run should focus on the ions that were not selected in the previous runs. I thought about this idea long ago, but my instrument, which runs on XCalibur, cannot have a parent ion exclusion list of more than 100. It was also a little cumbersome to implement to even exclude just 100 ions. However, it won’t be so difficult to create an exclusion list from the first run. Let’s say you exclude 90% of most intense ions. However, to successfully exclude the ions found in the first run, the software needs  to be able to handle the slight change in retention time in the following run. This could be tough because even if you have spike controls to adjust retention times, these ions have to be found first.  I wonder how it will be if the second run only focuses on low-abundance ions and ignores signals above certain levels.  I can guess the problem is that too many ions are coming off the column at the same time in some part of the gradient and the machine simply cannot pick them all up.

In any case, 6 technical replicates for every complex sample is too costly. Even though duplicates are a minimum requirement, ideally triplicates or more are necessary. Maybe multiplexing of multiple samples and triplicate runs would be the way to go in the future.

The number of protein identifications does not only depend on technical replicates.  It also depends on which search engine you use.  If you use search engines with different algorithms, the IDs differ quite dramatically. Well known programs such as X! tandem, SEQUEST and MASCOT give quite different peptide/protein list. For example, David Tabb’s group published a paper in 2007 describing the new search engine MyriMatch. It identifies more proteins than X!tandem and Sequest. But each search engine pair has only a 60~70% overlap. I have seen similar number with MSGFDB+.  Because computational analysis is much cheaper than running mass spec instruments, it is wasteful not to search with at lease several search algorithms before you move on.

Screen Shot 2013-08-22 at 10.52.23 PM

In the previous post, I described the power of parallel computing. If you have unlimited computing resources, most of searches are done pretty quickly.  Here I am going to implement three batch searches for each mass spec sample using X! tandem, MSGFDB+ and Myrimatch. All search programs are freely available in linux environment (windows also), but I have an access to computer cluster running on linux. All scripts described here will be shell scripts.

I downloaded MSGDF+ from here. I modified the batch search program from X!tandem.

Screen Shot 2013-08-22 at 11.15.52 PM

For Myrimatch, the program I downloaded from Tabb’s website gave a segmentation error when I executed it. So I used a different version downloaded from here. Batch search for myrimatch looks like this:

Screen Shot 2013-08-22 at 11.20.05 PM

To run all three search engines together, you can simply put all three scripts in one file but you probably have to change the file path for each search program.

myrimatchtandemmsgf

In this script, you need to have X! tandem configuration files (default_input.xml, input.xml and taxonomy.xml), myrimatch configuration file (myrimatch1.cfg) and MSGF+ configuration file (Mods.txt) in current directory.  In the future, I will modify the program so that you don’t have to change these configuration files before you start searches.

Power of Parallel Computation

Average person may think computer is getting faster and faster every year.  That is true, but in reality computer itself is not getting so much faster. Actually speed of computation depends on clock speed (among other things) but it cannot go too high (something like 3.2GHz). Above that it simply uses too much electricity and generates too much heat. If you look around, new computers don’t have much higher clock speed than 10 years ago. The major difference is the use of multi-core CPU chips. Your computer may be equipped with dual, quad core or octa core…..

Super computer or large computer clusters have thousands or tens of thousands of core CPUs. Individual CPU may not be so much faster than your computer, but if you have a large task,  you can divide the work to many CPUs to finish quickly. Memory usage is also the key part of fast computation.  Big chunk of computation time could involve just transferring data back and forth between memories.

There are four types of memory in the computer.

L1-L2 cash
DRAM
SSD
DISK

L1-L2 is a small size memory on the chip with CPU. It is extremely fast but it can hold only 32-256 bytes. The next one is DRAM, DRAM is what people usually talk about the memory for the computer. It is fast but it is still expensive compared to other storage.  Solid state drive (SSD) and DISK have much larger space and they are cheap. But speed of access is slower.

In order to maximize the computational capacity, one can control memory usage when he/she writes a program. Mass spec search engines are capable of using multi-core CPUs. If you have an access to cloud computer (e.g. Amazon) or large computer cluster, you can write a script to submit a bunch of batch searches.

In the previous posts, I described the way to automate X!tandem search. However, this method is written to run the tandem search sequentially. If you have limited number of cores, this is fine. But if you have many cores available, you can submit each search as one batch. If you do this, as soon as CPUs are available, next search will be started. All you need is to change the one code in the program.

To submit a job in linux, the command you need to use is qsub command.

In the previous X!tandem search on Linux post, I explained how to run automated X!tandem search. In line 6,

./tandem.exe input2.xml

We need to change to

qsub -l i,h_rt=4:00:00 -cwd -b y ./tandem.exe input2.xml

This will allow us to run X!tandem search with maximum 4hrs. To check the status of batch job, type myjob.
Screen Shot 2013-08-16 at 6.46.07 PM

Because all searches are submitted individually, if there are enough nodes available for search, the search time for all samples is essentially the same as the one which takes longest in the samples.  Imagine if you have 20 mass spec samples, and search the database with three search engines. This will be 20×3=60 searches. If you do this sequentially, it will take 60x  time to compute, assuming each search engine spends equal time to search (in reality, some search engine is much faster than the others).  The power of parallel computing is exposed when you can divide your job into multiple of small jobs and have enough CPUs available.

Program bug-fix on 08/21/13

There is a bug in batch searches using qsub command. If there are enough nodes to run all batches immediately, the current program works just fine. However, if available nodes are limited, and some batch searches are put on queue, it will search the same mgf file over and over. To fix, the program needs to generate input.xml file for each search. This is very simply to do, so I add the fixes here.

revised_autotandem

Basically, I add numeric variable $n. After every batch search is submitted, n is increased by 1.  A new input.xml file is created with a numeric postfix .

Installing Galaxy-P on Linux Server

In the Genomics world, Galaxy has been well known. Originally PennState researchers created as an interface to access, manage and manipulate genomic data (e.g. DNA sequences).  Recently, Galaxy-P was developed by University of Minnesota as a multi-omics analysis platform. There are public Galaxy-P sites where you and I can send mass spec data, then search databases and analyze the results. You can also locally install the platform, and then you can develop a new tool and manage the server for future expansion and development. Here I will be installing Galaxy-P on my linux server (running Ubuntu) and test it out its functionality.

First you need to install Mercurial by following command

>sudo apt-get install mercurial

This will take  a few MB of disk space.  Then you need to get Galaxy-P source code distribution.

>hg clone https://bitbucket.org/galaxyp/galaxyp-central

If everything is downloaded correctly, you see the messages like this.
install_galaxy-p

To start running a Galaxy-P server, type

>sh run.sh –reload

To test if all installation and running server are correctly done, open web browser (firefox or something), and type localhost:8080 in the address box.

Galaxy-P_start_up

You should see the GUI of Galaxy-P in the browser.

Now, let’s make this accessible from anywhere on the internet. Right now you have an access to Galaxy-P on local computer only.

First we need to modify configuration file called universe_wsgi.ini

Change the current directory to galaxyp-central and open universe_wsgi.ini file using text editor. For example,

>vim universe_wsgi.ini

Then change the line which defines port and host, then save.

For port we will use 8080 and host is changed to 0.0.0.0
Screen Shot 2013-08-06 at 8.00.25 PM

Then you need to open the port 8080. To do it, you need to change port forwarding if you are using router.

Screen Shot 2013-08-06 at 7.58.24 PM

Now the port is open, we are ready to run the server.

>sh run.sh –reload

If all successful, try accessing it outside the network by opening the browser and type the server’s IP address:8080 in the URL.
Either  XXX.XXX.XXX.XXX:8080 or domain_name:8080 is ok.

Screen Shot 2013-08-06 at 8.21.07 PM

The browser should start loading Galaxy-P and it will identical to localhost screen above.

The more detailed installation can be seen here and here.
This post is just to show you how to install Galaxy-P on your local machine or server. If it is installed on the server, you can use it remotely. I will likely discuss about their functionality in the future post.

Batch X!tandem Search on Linux -cont’d

In the previous post, I explained the method #2 for batch search using X! tandem. In this method, you still have to type file path and file name for all ms/ms data. It only takes a single script to upgrade to method #3, in which the program grabs all spectrum files such as .mgf and create a file containing the path and name in a specified directory.

In Linux, FIND command is used to look for files in a specified directory with a specific string. In this case, you want to find only spectrum files in the directory.  The syntax of the command is

find where-to-look condition what-to-do

Let’s specify where-to-look by
spec_folder=”../../MS_data/073113exp”

Condition is to find files (not directory) with mgf extension.

-type f -name *.mgf

You may not want to look in subdirectory. If so, add

-maxdepth 1

Finally, you write the file names in file_name.txt using “>”.  So final script looks like

find $spec_folder -maxdepth 1 -type f -name *.mgf > file_name.txt

In summary to run batch X!tandem search on linux,

1) Put all spectrum files in one directory
2) Edit the first line of the program to specify the directory
3) Edit default_input.xml (if you want to change modifications, error tolerance etc)
4) Edit input.xml (if you want to change species)
5) Edit taxonoy.xml (if you need to add databases)
6) Execute the program (make sure it is executable and place in the same directory as tandem.exe)

Did it work? I hope so!

Final code can be downloaded here.

%d bloggers like this: