Archive | July 2013

Batch X!tandem Search on Linux

Screen Shot 2013-07-30 at 6.50.42 PM

Everyday I generate 6-10 MS/MS spectrum files and I got tired of repeated clicking and typing of file names, output names species for X! tandem. I want to automate the search, so I can do something else by saving time. I know there is a program for batch X! tandem, but you may not necessarily have graphical interface in your linux system (at least my linux core doesn’t) . So it is useful and more flexible if you can do this in command line. OK,  in order to run X! tandem, you need 4 files in the same directory
1) input.xml
2) default.xml
3) taxonomy.xml
4) tandem.exe
MS/MS spectrum file name and output file name  are the one you change often and these are stored in input.xml file. In order to automate search, there are several ways to do it.

1) Create as  many input.xml files as MS/MS spectrum files and then write a script to sequentially run tandem.exe
2) Create a file that contains MS/MS spectrum file names, then write a script to read it line by line, and modify input.xml file. Execute tandem.exe until all files are searched.
3) Place all MS/MS spectrum files in one directory, and write a script to run tandem.exe for all files in the directory

For 1), if you have only a few files, it is easy to implement. But if you have more files (>10), it is cumbersome  and 2) will work better. If you have many files (>50), typing (or copying + pasting) file names take time, so 3) will work the best.

Here, I am going to show you how to implement method 2). First you create a file, let’s say called “file_name.txt”. This file contains all MS/MS spectrum file names and directory information. For example,

../msdata/073013_exp1.mgf
../msdata/073013_exp2.mgf
../msdata/073013_exp3.mgf
../msdata/073013_exp4.mgf
………
………

Place this file in the same directory as all the other necessary files listed above. Then write shell scripts to automate the search.

1   while read line
2   do
3     echo -e “Writing $line in input.xml\n”
4     sed ‘s=spectrum, path”>*.*<=spectrum, path”>’$line'<=’ <input.xml  >input1.xml
5     sed ‘s=output, path”>*.*<=output, path”>’${line%.*}’_output.xml<=’ <input1.xml >input2.xml
6     ./tandem.exe input2.xml
done < file_name.txt

I used while loop to read each line in file_name.txt until it reaches to the end. Each line is stored in a variable ($line) and I want to insert this variable in the certain places in the txt file. Now if you look at input.xml file in X! tandem, input file and outputfile names are defined in two lines (2nd and 3rd line from the end).

Screen Shot 2013-07-30 at 11.11.41 PM

SED command is very useful to find & replace a character string. The basic format is

sed /s/abc/def/ <file

Here string abc is replaced with def if it finds in file. Slash (/) is used as delimiter. However, you have to be careful what delimiter you want to use. As $line contains slash, you cannot use slash as delimiter. I used equal character (=) which is not used in this regular expression. For more detailed usage of SED command, click here.

Finally, results will be written as input_file_name_output_XXXX_XX_XX_XX_XX_XX.t.xml in the same directory as the input MS/MS files.

Advertisements

Which Trypsin Do You Use?

Although this topic is not directly related to bioinformatics, I think it is generally useful information to many proteomic researchers. Trypsin is a universal proteinase for mass spec experiment. Trypsin generates nice peptides that can be ionized well and is not so expensive for routine in-gel digestion or in-solution digestion.  Protocol has been established very well and many people assume what they are doing is optimal condition. Is it true?

Today I came across a paper published recently online on Journal of Proteome Research.

Benchmark_of_6_trypsin

In this article, the authors compared 6 commonly used trypsins in terms of specificity, efficiency, contaminants and so on. They used 8 standard proteins bought from Sigma. These proteins are solublized in 6M urea/TEAB, reduced and alkylated. Then the protein solution is diluted with TEAB and digested with trypsin overnight. They compared these trypsins

Promega (#5113): Sequencing grade modified trypsin$41/100ug
Sigma (T6567-5X20UG): Trypsin (porcin) $49.7/100ug
Sigma (T1426-50MG): Trypsin (bovine) $36.1/50mg
Sigma (T3568): TrypZean (bovine expressed in maize) $165/10mg
Worthington-Biochem (LS02120): Tryspin (bovine) $56/100ug
Roche (03708985001): Trypsin (porcin expressed in yeast) $80/100ug

As authors mentioned the price varies dramatically, Roche is most expensive of all (1.25ug/$) and bovine trypsin from sigma is the cheapest (1385ug/$).

The mass spec analysis they used is both label free and TBT-labeling. Peptides are analyzed with nLC/MS/MS Orbi Velos or Q-Exactive.

Here is the conclusion

Contaminants:
Sigma (bovine): Chymotrypsin
Roche: 5 yeast proteins
Sigma (TrypZean): 5 maize proteins

Specificity & Efficiency:
Fully tryptic: more is better
Promega=TrypZean>Sigma Porcine=Roche>Sigma=Worth.Bio

Missed cleavage: fewer is better
Promega=Sigma Bobine=TrypZean<Sigma Porcine=Roche<Worth.Bio

Non-tryptic, tryptic only N- or C-term: fewer is better
Promega=Sigma Procine=TrypZean=Worth.Bio=Roche<<Sigma Bovine

Based on these results, Promega trypsin is a good choice to start as this product has been around for a long time. I personally use this for most of application. However, if you need to use a large amount of trypsin, TrypZean may be worth considering. Especially if you want to study post-translational modification, you probably need to digest a large amount of protein first, then purify specifically modified peptides. Bovine trypsin from Sigma is cheap but it contains chymotrypsin which increase non-specific cleavages.

They also tested different protein:trypsin ratio 1:20 to 1:100. They found that the number of identified peptides did change so much, but at higher ratio (1:20), they observed more non-tryptic peptides due to increased activity of contaminated chymotrypsin or other proteases. So no need to use more trypsin than necessary.

Create A Quick Relational Database For Data Analysis Using R

Mass spec analysis pipeline often requires combining all data from a single experiment and create a relational database for final visualization. Analysis programs such as PeptideShaker or IDPicker produce ID files for individual samples, but it doesn’t have a feature to combine different conditions and compare them at the same time. So I am going to show how you can combine multiple datafile using R.

To summarize the process

1) Save result files (.csv or .txt) in one directory

2) Load all files into R

3) Create a table that contains ID and Description without redundancy

4) Connect all data files to 3)

To create a relational database in R, there are several required packages. The most important one is called “sqldf” packages. However, there are some other packages in order to use this package. In my mac, these are required to use sqldf. Note that to use sqldf, you also need to install XQuarts.2.7.4 (or some sort) on your mac.

>install.packages(“DBI”)
>install.packages(“gsubfn”)
>install.packages(“proto”)
>install.packages(“chron”)
>install.packages(“RSQLite”)
>install.packages(“RSQLite.extfuns”)
>install.packages(“sqldf”)

Then load these packages

>library(“DBI”)
>library(“gubfn”)
>library(“proto”)
>library(“chron”)
>library(“RSQLite”)
>library(“RSQLite.extfuns”)
>library(“sqldf”)

Set  up working directory (I am going to  use R-test on desktop)
>setwd(“~/Desktop/R-test”)

Read data file (.csv) into R. For .txt file use an appropriate separator, usually tab delimited (sep=”\t”).

>files<-list.files(pattern=”*.csv”)
>files # show filenames in the working directory
>rdf<-lapply(files, read.csv) # files are loaded into rdf

rdf is a list composed of multiple tables loaded from the working directory.
For simplicity, we have three files in the directory. You can do step for many more files using for-loop or apply function. In this example,the data files contain three columns, Accession, Description and Score. For example,

table1
Accession Description Score
1                   Gene A          100
2                   Gene B           50
3                   Gene C          75

table2
Accession Description Score
2                   Gene B          42
5                   Gene E          88
3                   Gene C          37

table3
Accession Description Score
1                   Gene A          100
3                   Gene C           50
6                   Gene F          22

Note: There shouldn’t be any duplicates in accession in each table. If it does, the final list will be a longer one with many duplicates.

Convert the three tables in the list to data frame and stored in table 1, 2 and 3.

>table1<-as.data.frame(rdf[1])
>table2<-as.data.frame(rdf[2])
>table3<-as.data.frame(rdf[3])

Create a table that contains only Accession and Description using union

> unionall<-sqldf(“select Accession, Description from table1 union select Accession, Description from table2 union select Accession, Description from table3”)

Finally, merge all three files and accession+description table.

>mergeall<-sqldf(“select unionall.*, table1.Score, table2.Score, table3.Score FROM ((unionall LEFT JOIN table1 ON unionall.Accession = table1.Accession) LEFT JOIN table2 ON unionall.Accession = table2.Accession) LEFT JOIN table3 ON unionall.Accession = table3.Accession”)

The mergeall table should look like this

Accession    Description   Score Score Score
1                   Gene A         100    NA      100
2                   Gene B          50      42       NA
3                   Gene C          75      37       50
5                   Gene E          NA     88       NA
6                   Gene F          NA     NA      22

You see that each accession number only appears once. If some genes are not identified in particular samples, they will be left as NA.

Big Data Transfer Using Globus Online

One of the major problems in dealing with large data is to transfer files via network. Mass spec files files are fairly large, often exceeds gigabytes. Computation of mass spec results may take some time, but file transfer can take longer than computation in some cases. If a mass spec file is transferred simultaneously, or as soon as mass spec run is over, total amount of transfer time is similar to the mass spec run time (I am assuming that your network connection is faster than generating mass spec raw data).  For a few file transfers, SCP/FTP is ok, but large file transfer is better with Globus online.

Screen Shot 2013-07-08 at 7.08.42 PM

What this site does is to transfer files (especially large ones) across the internet fast, and reliably. You can also share files with multiple users. Operation is as simple as Dropbox.

First, you create an account for yourself  at Globus online by clicking “sign up now”. After verifying your account with email, you can immediately start using their service. There are things you can do it for free such as:

1) File transfer and synchronization to/from servers

2) Create private and public endpoints

3) Access to shared endpoints created by others

Things that are NOT free ($7/mo or $70/yr):

1) Peer-to-peer transfer and share files

2) Create and manage shared endpoints

As far as I understand, if you have university server which has already signed up for Globus, you can create an endpoint on your computer and start transferring files between the registered server and your PC without any charge.

These are the steps to use Globus online for file transfer.

1) Sign up for a new account
2) Log on to Globus online, and go to  “Manage Data” on the top of the page, and select “manage endpoints”.
3) Click “add Globus Connect”.
4) Type endpoint name (e.g. myPC) and click “generate setup key”.
5) Select the operating system you are using (Mac, Linux or Windows).
6) Install globus online on your computer (automatic).
7) Open Globus Connect application.
8) Copy the setup key and paste into the box in Globus Connect app. Hit OK.
9) Click “start transfer” under “Manage Data” tab on the top.
Screen Shot 2013-07-08 at 7.24.18 PM

10) Type your endopoint name (account_name#computer_name) in the Endopoint box (either left or right). Then click “Go”.
11) Type the server endopoint name in the other Endopoint  box. Then click “Go”.

Once you see file structures for both sides, you can start transferring files. Select multiple files by pressing CTRL key, then hit the arrow head to send the files.

Once you started, you can see the job in the transfer activity page. Real advantage of transferring files using Globus online is

Not only you have faster transfer over the internet, you don’t have to log in to the computer you want to transfer files from. This means, you have three computers (A, B &  C), and you can direct transferring files from A to B or vice versa using Computer C.

You can also quickly synchronize the directory by clicking an option and check “only transfer new or changed files”. You can further select an option how you want to define new or changed files.

Screen Shot 2013-07-11 at 8.53.38 PM

I also put a link to Globus online manual here.

Note: I had a trouble with sending files from my computer initially and found that my computer was in the part of University Hospital network which restricts ports used by Globus online. Since it was not possible to ask network admin to open a port for Globus online, I had to use VPN to mitigate the issue.

Overall, I am pretty satisfied with the speed of transfer. Globus online is ~5x faster than SCP.

Dunnett’s Multiple Test of The Difference Using ‘R’

I recently have encountered a statistical question simultaneously comparing multiple groups on the difference of certain characteristics. Normally, I use some statistical programs like Minitab, to run multiple t-test (e.g. Dunnett’s, Duncan etc), but I couldn’t find the way to compare the difference among groups using these programs straight.

Here is an example, there are 4 class rooms and each class room is divided by  male group and female group. Then took a measurement of height of students. I want to test whether the difference in height between male and female students are statistically significant when all classes are compared to each other.

Class     Sex            Height [cm]
A         Male           175
A         Male           169
A         Male           177
A         Female         162
A         Female         167
A         Female         155
B         Male           183
B         Male           171
B         Male           179
B         Female         171
B         Female         166
B         Female         164
C         Male           162
C         Male           175
C         Male           176
C         Female         161
C         Female         165
C         Female         152
D         Male           175
D         Male           178
D         Male           172
D         Female         161
D         Female         167
D         Female         153

Calculate the difference of average height for each group.

      Avg Male   - Avg Female
A     173.6      -  161.3             = 12.3
B     177.6      -  167.0             = 10.6
C     171.0      -  159.3             = 11.7
D     175.0      -  160.3             = 14.7

By looking at the number, D has the largest difference between Male and Female in height and B has the smallest difference. Are they statistically significant? How do we test?

In “R”, there is a package for multcomp. In this package, there are several important features for this type of analysis.
Please read here for more details.

>library(multcomp) ## install multcomp package

Before start using “R”, let’s save the data above as “test.csv”. You need to add column name “Group”, “Sex” and “Measure” at the top of each column. Remember the directory you save the file (e.g. ./data).
Let’s set working directory and read  the test.csv file.

>setwd("./data")
>dat<-read.csv("./test.csv")

Then run ANOVA. Measure is height, Group is A~D and Sex is male or female.
Note that Group and Sex may interact each other, so you use * to run ANOVA. “-1” will remove intercept term.

>mod <-aov(Measure~Group*Sex-1,data=dat)

Print coefficients

>coef(mod)

GroupA          GroupB              GroupC           GroupD
161.3333333 167.0000000 159.3333333 160.3333333

SexMale       GroupB:SexMale     GroupC:SexMale   GroupD:SexMale
12.3333333 -1.6666667                -0.6666667              2.3333333

You may wonder what these numbers are….. But it is not hard.
First row is simply an average of female height for each Group.
SexMale (12.3333) is the difference height in Group A.
GroupB:SexMale (-1.6666) is the difference of the difference height B and A.
GroupC:SexMale (-0.6666) is the difference of the difference height C and A.
GroupD:SexMale (2.3333) is the difference of the difference height D and A.

As you can see, A is always used as reference to generate these numbers.
Now, to test difference in A and B is to test GroupB:SexMale=0
To test difference in A and C is to test GroupC:SexMale=0
To test difference in A and D is to test GroupD:SexMale=0

How about the rest of the comparisons?

To test difference in B and C is to test GroupB:SexMale -GroupC:SexMale=0
To test difference in B and Dis to test GroupB:SexMale -GroupD:SexMale=0
To test difference in C and D is to test GroupC:SexMale -GroupD:SexMale=0

The code for this is

>summary(glht(mod,linfct =
c("GroupB:SexMale = 0",
"GroupC:SexMale= 0",
"GroupD:SexMale= 0",
"GroupB:SexMale - GroupC:SexMale= 0",
"GroupB:SexMale - GroupD:SexMale= 0",
"GroupC:SexMale - GroupD:SexMale= 0")))

Simultaneous Tests for General Linear Hypotheses
Fit: aov(formula = Measure ~ Group * Sex – 1, data = dat)
Linear Hypotheses:

Estimate Std. Error t value Pr(>|t|)
GroupB:SexMale == 0                               -1.6667     6.6792     -0.250 0.994
GroupC:SexMale == 0                               -0.6667    6.6792     -0.100 1.000
GroupD:SexMale == 0                                2.3333    6.6792       0.349 0.985
GroupB:SexMale – GroupC:SexMale == 0 -1.0000    6.6792     -0.150 0.999
GroupB:SexMale – GroupD:SexMale == 0 -4.0000   6.6792     -0.599 0.931
GroupC:SexMale – GroupD:SexMale == 0 -3.0000   6.6792     -0.449 0.969
(Adjusted p values reported — single-step method)

Well, in this example none of the comparison ended up in significant p>0.05 (numbers on the right).
I think standard deviation of height was probably too large.
Please test with your own data to see if it works.

%d bloggers like this: