Archive | October 2013

Regular Expression Tutorial 1: special characters

Regular expression is a computer code, useful to find certain strings in a text file. It can also do ambiguous matching with complex conditions. I use it quite often but forget some details, so it would be useful for people to refresh the knowledge of regular expression.

There are certain characters used in regular expression which have special meanings.  In other words, these characters are not read as the way you see in word or notepad. If you have these characters in the regular expression, the program translates differently.

     Name              Function
\    back slash        escape character
[ ]  square brackets   single character match
{ }  curly braces      repeats
( )  parenthesis       reference or subexpression
^    hat               beginning of a line (not string)
$    dollar            end of a line (not string)
|    pipe              alternation [OR]
*    asterisk          zero or more times of repeat
+    plus sign         1 or more times of repeat
?    question mark     occur 0 times or once 
.    dot               any single character
!    exclamation       negation [NOT]

Back slash “\” is the first one in the list and this character is used as “escape”, meaning if the program sees this character, it will do different things depending on what character comes next. The list below is to specify non-printable characters.

\t   tab
\n   new line
\r   carriage return
\f   form feed character (end of page character)

Back slash can also be used to specify certain character class

\s     a white space
\S     non white space
\d     a digit [0-9]
\D     non digit [^0-9]
\w     word character [a-zA-Z_0-9]
\W     non word character
\b     similar to \W but creates boundary (read below)
\A     beginning of a string
\z     end of a string

\W and \b are similar in the sense that both of them recognize the string starting with non-word character. The difference is \b includes only the part after non-word character, while \W select non-word character and the word. For example in a sentence “It is very hot today.”, \bho recognize only “ho”, while \Who includes the space character and “ho”. \b defines the word boundary, so if you want to search a word “all”, you can use /ball/b. It will not match “tallest”, ‘ball” or “alleged”.

Another way of using \ is to use special character as a normal character. For example, \^ will work to search ^ character. Double back slashes \\ is used to search \ character.

Brackets [ ], brances { } & parenthesis ( ) in regular expression treat inside characters differently. Combinations of these allows more complicated methods for string match.

[ ] Square brackets:  Match one character if the character is inside of the square brackets. [A-Z] matches  a character A to Z. [0-9] matches a digit (same as \d). [a-zA-Z_0-9] matches to any word character. K[ab2] matches to Ka, Kb and K2.
^ inside of  square brackets [^ ] will match the characters NOT in the square brackets. For example [^Ab] will match any characters except “A” and “b”.

{ } Curly braces: Usually, you have one or two number(s) in curly braces, such as {3}, {4,5} and {2,}. It takes the preceding character and look for the repeats specified in the braces. For example, b{3} matches bbb. [1-3]{2} matches to 11,12,13,21,22,23,31,32,33. If two numbers are specified, repeating numbers have to be in that range. a{2,4} matches aa, aaa and aaaa. If there is no second number with comma, it translates as minimum number of repeats.

( ) Parenthesis: Put characters inside of parenthesis, if you want to refer the matched string again. To refer, use back slash and a number which indicates the position of the parenthesis. For example, ([ac]b)\1 mathes either abab or cbcb because \1 is referring to the string in the parenthesis, which is either ab or cb. I explained a little more about usage of parenthesis in another post. Parenthesis is also used together with pipe |. Pipe is alternation, in math it is equivalent to OR. So if you want to match a string abc OR acc, you can do like (abc|acc).

( ) Parenthesis with ?: When ? is inside of parenthesis, it works quite uniquely dependent on what comes after that (e.g. =,!,<=,<!,>).  It is used to capture substrings of an input strings. Why do you need this kind of function? Let’s say you want to capture “gold” in the sentence “I like the gold coin.”, but only when it is followed by “coin”.  Cases like this, when you want capture something with conditions but you don’t want to capture the conditions), parethesis with ? inside will be used.

For example, if it is together with equal sign (=), it will look for a substring preceding the parenthesis with conditions, but it only capture the substring outside of parenthesis.

gold(?=\scoin)              I really like the gold coin.

Red part is what is captured, and green is the condition met in the parenthesis. To do negative assertion, you use exclamation ! instead of equal =. For example, if you want to capture “gold” not followed by ” coin”.

gold(?!\scoin\b)           I really like gold medals but not gold coins.

It captures first “gold”, because it is not followed by “coin”, second “gold” is NOT captured because \b defines the end of the word. If you remove the \b, it will capture both “gold”s. If your condition is before the string you want to match, you put < between ? and =.

There are more ways to capture a string with conditions using different special characters and they are also quite useful. I am not going to discuss with these, please refer here.

If you want to quickly test how your regular expression behaves, you can use regular expression test web site (1 & 2) or free text editor such as EditPad Pro.

Also you can use a lot of random words to check what actually match with your regular expression.

Other useful links

Common regular expressions (e.g. URL, email and etc)

Regular expression library

More regular expression examples

Advertisements

Connecting Any Number of Tables using SQL in R

In the previous post, I showed how to create a relational database using sqldf package in R. The example was with three tables but I would like to expand this program for any number of tables. Users can place as many (csv) files as they want in one directory, then the program reads all files, and connect them to create a database. I think this makes the program more practical as the number of files to connect varies dependent on the situation.

The input csv file in this example is derived from IDPicker. Note that gene accession on the left has only unique entries (no duplicates).
Screen Shot 2013-10-17 at 11.00.53 PM

Place multiple csv files in one directory.
Screen Shot 2013-10-17 at 11.03.43 PM

Then run a program to make a file looking like below. You see Gene ID (accession) and description followed by data from each file. If data are missing, you see NA.
Screen Shot 2013-10-17 at 11.07.31 PM

The structure & process of the program is

1) Install and read sqldf library (before running the program)
2) Specify directory with setwd() function
3) Read all tables (.csv) into R
4) Extract key (accession) from each table and create a table (key_table) containing all keys from the files using union
5) Connect a first table to key_table
6) Connect the second table to 5)
7) Loop connecting files until all files are connected
8) Save and return a connected table

I think it is better to use this program as function, and this function takes a directory as parameter. So you will use this function like

data<-connectAll("/home/data/exp1")

It will be also useful if the created table is saved automatically in the same directory. I will put this parameter as well (default will be FALSE).

data<-connectAll("/home/data/exp1",save=TRUE)

1) Install and read sqldf library (please see previous post for dependent packages for sqldf).

install.packages(sqldf)
library(sqldf)

2) Specify directory with setwd() function
The program takes an argument from the function call and change the directory. So it  starts like below.

connectAll<-function(dir, save=FALSE){
setwd(dir)
.......
}

3) Get file names in the directory, then read all tables (.csv) into R

files<-list.files(pattern=".csv")
rdf<-lapply(files, read.csv)

4) Extract key (gene IDs) & description from each table and create a table (key_table) containing all keys from the files using “union” in SQL. Note that you need to store each table in a new dataframe vector called x. If you put this statement inside of sqldf, you get an error. Then for loop will add unique entry of gene IDs until it reaches to the end. “length()” command takes care of the number of looping.

ftable<-key_table
for(i in 2:length(files)){
  x<-as.data.frame(rdf[i])
  ftable<-sqldf("select Accession, Description
  from key_table union select Accession
  , Description from x")
  }

5) Connect a first table to key_table
6) Connect the second table to 5)
7) Loop connecting files until all files are connected

ftable<-key_table

for(i in 1:length(files)){
  x<-as.data.frame(rdf[i])
  colnames(x)[colnames(x)=="Filtered.Spectra"]<-
    "Filtered_Spectra"
  ftable<-sqldf("select ftable.*,x.Filtered_Spectra
     FROM (ftable left join x ON ftable.Accession 
     = x.Accession)")
  colnames(ftable)[colnames(ftable)==
    "Filtered_Spectra"]<-paste(gsub("([A-Za-z0-9]+).csv"
    ,"\\1",files[i]),"_Filtered_Spectra",sep="")
  }

Here, I picked “filtered.spectra” column to show in the output. The sqldf doesn’t like having  a dot “.” in the name because it is confusing…. SQL uses “.” to specify column from tables, so having “.” in the name of the column doesn’t work. The last line adds file name for each column because it would be difficult to look at the table if all columns have the same name “filtered_spectra”. I selected “filtered_spectra”, but user can change or add to it depending on what they want to see.

8) Save file and return connected table. The name of the file will be “final.csv”.

if(save==TRUE){
#Replace all commas to spaces in description column
ftable$Description<-gsub(","," ",ftable$Description)     
write.table(ftable, file="final.csv",quote=FALSE,sep
  =",",row.names=FALSE) 
}
return(ftable)

This function is easy to expand more functionality. For example, you can add another argument to pick columns you want to show in the final table. This program was written for protein ID, but it can be done at ion levels with a charge state.

If you have to analyze data with database software (e.g. microsoft access) on daily basis, and your work can be done using the program like this, it will cut a significant amount of your time.

Final code is here.
Test csv files are here.

P- vs Q- vs PEP-values in Mass Spec Database Search

We often see both P and Q-values in literature and search results by mass spec software, but how many of us actually know how these numbers are generated?  OK, you may say you know that P-value is probability. But what does this really mean? Can you explain?

For peptide spectra matching (PSM), let’s say you have an experimental MS/MS spectra X with 60 peaks and a theoretical MS/MS spectra Y (derived from protein database sequences) with 32 peaks. If 23 peaks match between these two spectrum, how likely does this kind of situation occur by chance?

Actually, this is not enough to calculate the probability of such an event. Because you need to know how many possible positions in the spectra. To be simple, let’ think about this scenario.  Spectra X’s peaks range from 350 to 700, whereas spectra Y ranges from 400 to 850. You have an overlap between 400 to 700 in m/z. If you have mass tolerance of 0.5m/z, you divide this range by 0.5 which is (700-400)/0.5=600.  This number is the number of bins or positions you have in the overlapping range.

spectra_match

Calculate Possible Occasions for Experimental MS/MS

The likelihood to have 23 peaks match from spectra X (60 peaks) and spectra Y (32 peaks) within 600 possible peak positions is calculated by

p-formula

In this formula, denominator calculates all possible combination to pick l(60) peaks out of N(600) positions. The numerator is a product of two numbers, that are the number of possible combination of picking n(23) positions out of m(32) and the number of possible combinations of picking 37 out of 568. The first combination is essentially how many possibilities to have 23 peaks in the correct database mass spectra which has total 32 peaks. The second combination is how many possibilities the unmatched peaks fall in the rest of bins in the range.  The P-value calculated from this formula is 7.72xE-35. This is a very very small number and it means that the scenario described here is a highly unlikely event by random matching. But is this true?

Peak matching is not a completely random event

In fact, spectra matching does not behave like random matching. Let’s look at one example. In this example, theoretical spectra of peptide KGHHEAELKPLAQSHATK (+2) had 18 matches with an experimental spectra. This experimental spectra is searched against a database and examined for the number of peaks matched to the spectrum in the database.

distribution of matched peaks
Fridman et al., J Bioinfo Comp Biol, 2005: 455-476

You can see the distribution of matched peaks (total area under the line should be 1) has shifted towards more right compared to theoretical peak matching distribution (use the same number of peaks with the same range as experimental spectra).

This means that real spectrum more likely match to the theoretical spectrum derived from the databases than pure random matching.

Why?

Because peptides are composed of amino acids with distinct molecular weight. Therefore, the bins in the MS/MS have unequal likelihood to match to experimental spectrum. In fact, the smaller the m/z, the more distinct m/z due to smaller number of amino acid combinations. To correct the experimental distribution of p-value, you can shift the curve by 2 matches in this case.

P-value distribution of random matching follows poisson distribution

Poisson distribution is the probability distribution of a given number of events occurring in some fixed amount of time. Smaller average events, the sharper the peak distribution. Also poisson distribution is only in a positive range as there is no negative number for events. The number of peaks (events) matching to database spectrum for many peptides appear to follow this type of distribution (but not all peptides, of course!). OMSSA uses poisson distribution for P-value modeling.

distribution estimation

However, calculating P-value strictly based on the number of matched peaks is obviously not the best method. There are many properties in MS/MS spectrum which provide useful information for identification. Peak height intensity, intensity distribution,  presence of both b & y ion pair are examples. Good matching often comes with matching intense mass peaks. Mass spec search programs (e.g. X! tandem) take these into account to generate scores. The probability distribution based on scores is used to calculate P-values. There is an excellent PowerPoint slide how to calculate hyperscore for X!tandem.

I want to point out that each search program calculates P-value distribution using specific formula. What is critical is how each formula handles right-tail of P-value distribution which will be important for real-matching spectrum.

Q- & PEP-value

How about Q-value? Q-value is defined as minimum false discovery rate (FDR). If you search MS/MS spectrum against a decoy database, you know that all PSMs are incorrect. If you search against true databases, you get the mixture of correct and wrong hits.

q-value

As you can see the graph above, you can guess the number of falsely identified hits in blue line by deducting the hits from the decoy database search in red. For example, if you draw a line and you see only one wrong hit from decoy search, and 99 hits from true search, you get 1% FDR. Q-value is associated with each PSM, meaning every PSM has Q-value. The Q-value 0.05 means that there are 1 in 20 of higher ranked PSMs that are likely wrong.

Posterior error probability (PEP) is defined as a local probability for the PSM with a given score, which is calculated by the ratio of hits from decoy and true database search with a given score. In the figure above, it is the ratio of  height for a given score in red and red+blue line.  For example, if the score that gives an equal number of matches in both decoy and true databases, PEP will be 0.5, meaning at this score the chance of being incorrect is 50%. This concept is important especially when you find a PSM that was found with 5% FDR, but its PEP is 20%.

There is an excellent review which is easy to read and understand these concepts published by Käll and Noble et al., in 2007.

%d bloggers like this: