Archive | Regular Expression RSS for this section

Regular Expression Tutorial 2: Commands in R

The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to write a regular expression to match a string, you may want to manipulate strings such as deletion or replacing. Here is the list of string matching &manipulation commands commonly used with regular expressions in R. These commands also appear in many other languages.

Command          Function
grep( )          Return index of the object 
                 where reg exp found the string
grepl( )         Return logical values for reg exp 
regexpr( )       Return the first position of found
                 string by reg exp
gregexpr( )      Return all positions of found string
                 by regexp
sub( )           Substitute a pattern with a given string
                 (first occurrence only)
gsub( )          Globally substitute a pattern with a 
                 given string (all occurrences) 
substr( )        Return the substring in the giving 
                 character positions (start and stop)
                 in given string
strsplit( )      Split the input string into parts 
                 based on another string (character)
regexec( )       Return the first position of matched 
                 pattern in a given string
regmatches ( )   Extract or replace matched substrings
                 from match data obtained by gregexpr,
                 or regexec

Find & Display Matching string: grep

[1] 1 2

The first one is grep() command, which was originally created in Unix system. Its name came from globally search a regular expression and print. You see “bc” appears in the first two entries of x. grep() function returns indexes of the matched string. If you want to show the matched entries (not index),  use value option  or  use square brackets.

[1] "abc" "bcd"
[1] "abc" "bcd"

Show Matched Pattern Using Find & Replace

If you want to get only the matched pattern, it is kind of awkward but you can use the output above and remove the unmatched part (In linux, you just use grep -o).

First, sub function’s syntax is

sub("matching_string","replacing_string", input_vector)

This function works like “find and replace”. Using this to remove unmatched part.

> sub(".*(bc).*","\\1",grep("bc",x,value=TRUE))
[1] "bc" "bc"

Remember .* means any character with any length and \\1 means the matched string in the first parenthesis. In this case, you see only “bc”, but if you use regular expression for pattern, you will see different kind of matches found in the string.

Remove Matched String

If you want to return indexes of unmatched string, add invert option.

> grep("bc",x,invert=TRUE)
[1] 3 4

Combining with value option, you can remove matched string from the vector

> grep("bc",x,invert=TRUE, value=TRUE)
[1] "cde" "def"

If the search is not case sensitive,

> grep("BC",x,
[1] 1 2

If you want to get logical returns for matches,

> grepl("bc",x)

Manipulating String with Matched String Position

To get the first position of the matched pattern in the string, regexpr() is used.

[1] 4
[1] 2
[1] TRUE

Since the first match occurs at 4th character in y, the first value returned is 4. If there is no match it will return -1.

If you want to get this value only,

> regexpr("ki",y)[1]
[1] 4

You see that regexpr() returns two attributes “match.length” and “useBytes”. These value can be accessed by

> attr(regexpr("ki",y),"match.length")
[1] 2
> attr(regexpr("ki",y),"useBytes")
[1] TRUE

If you want to get positions for all matches, use gregexpr()

> gregexpr("ki",y)
[1] 4 6
[1] 2 2
[1] TRUE

To show the only values of positions, you need to use length function. It is a bit awkward but can be done.

> z[[1]][1:length(z[[1]])]
[1] 4 6

regexec() command works very similarly to regexpr(), however if there is parenthesized matching conditions, it will show both matched string position and the position of parenthesized matched string.

> regexec("kik",y)
[1] 4
[1] 3
> regexec("k(ik)",y)
[1] 4 5
[1] 3 2

To extract a substring from an input string, use substr()

substr(x,start, end)
[1] "cde"

This function can also replace a substring in a string.

[1] "abXXef"

Another Way to Show Matched Strings Using regmatches()

I showed one way to list the matched string using sub() and grep() , you can do the same thing with regmatches together with regexpr() or regexec().
First, regexpr() gives you the position of the found string and the length of the mtached string in the input, you pass this information on to regmatches().  It will show all the matched strings from the input string. regexec() will show both matched substrings and matched substrings in the parenthesis.

> a<-"Mississippi contains a palindrome ississi."
> b<-gregexpr(".(ss)",a)
> c<-regexec(".(ss)",a)

> regmatches(a,b)
[1] "iss" "iss" "iss" "iss"

> regmatches(a,c)
[1] "iss" "ss"

The syntax of regmatches() is

regmatches(input, position&length)

Therefore, if you put position and length information of matched strings obtained from either gregexpr() or regexec() will be used to extract the matched string from the input. Note that regexec takes only the first match, you see only “iss” and “ss”.

Split Strings with Common Separator Using strplit Function

Suppose you have a date string “11/03/2031” and want to extract the numbers “11”, “03” and “2013”. Since the numbers are separated by the common character “/”, you can use strsplit function to do the job.

> strsplit("11/03/2013","/")
[1] "11"   "03"   "2013"

If you use “” for separator you can extract each character.

> strsplit("11/03/2013","")
 [1] "1" "1" "/" "0" "3" "/" "2" "0" "1" "3"

One thing you want to remember is when string starts with a separator, strsplit puts an empty character in the vector first.

> strsplit(".a.b.c","\\.")
[1] ""  "a" "b" "c"

If dot (.) is a separator, you need two backslashes for regular expression.


Regular Expression Tutorial 1: special characters

Regular expression is a computer code, useful to find certain strings in a text file. It can also do ambiguous matching with complex conditions. I use it quite often but forget some details, so it would be useful for people to refresh the knowledge of regular expression.

There are certain characters used in regular expression which have special meanings.  In other words, these characters are not read as the way you see in word or notepad. If you have these characters in the regular expression, the program translates differently.

     Name              Function
\    back slash        escape character
[ ]  square brackets   single character match
{ }  curly braces      repeats
( )  parenthesis       reference or subexpression
^    hat               beginning of a line (not string)
$    dollar            end of a line (not string)
|    pipe              alternation [OR]
*    asterisk          zero or more times of repeat
+    plus sign         1 or more times of repeat
?    question mark     occur 0 times or once 
.    dot               any single character
!    exclamation       negation [NOT]

Back slash “\” is the first one in the list and this character is used as “escape”, meaning if the program sees this character, it will do different things depending on what character comes next. The list below is to specify non-printable characters.

\t   tab
\n   new line
\r   carriage return
\f   form feed character (end of page character)

Back slash can also be used to specify certain character class

\s     a white space
\S     non white space
\d     a digit [0-9]
\D     non digit [^0-9]
\w     word character [a-zA-Z_0-9]
\W     non word character
\b     similar to \W but creates boundary (read below)
\A     beginning of a string
\z     end of a string

\W and \b are similar in the sense that both of them recognize the string starting with non-word character. The difference is \b includes only the part after non-word character, while \W select non-word character and the word. For example in a sentence “It is very hot today.”, \bho recognize only “ho”, while \Who includes the space character and “ho”. \b defines the word boundary, so if you want to search a word “all”, you can use /ball/b. It will not match “tallest”, ‘ball” or “alleged”.

Another way of using \ is to use special character as a normal character. For example, \^ will work to search ^ character. Double back slashes \\ is used to search \ character.

Brackets [ ], brances { } & parenthesis ( ) in regular expression treat inside characters differently. Combinations of these allows more complicated methods for string match.

[ ] Square brackets:  Match one character if the character is inside of the square brackets. [A-Z] matches  a character A to Z. [0-9] matches a digit (same as \d). [a-zA-Z_0-9] matches to any word character. K[ab2] matches to Ka, Kb and K2.
^ inside of  square brackets [^ ] will match the characters NOT in the square brackets. For example [^Ab] will match any characters except “A” and “b”.

{ } Curly braces: Usually, you have one or two number(s) in curly braces, such as {3}, {4,5} and {2,}. It takes the preceding character and look for the repeats specified in the braces. For example, b{3} matches bbb. [1-3]{2} matches to 11,12,13,21,22,23,31,32,33. If two numbers are specified, repeating numbers have to be in that range. a{2,4} matches aa, aaa and aaaa. If there is no second number with comma, it translates as minimum number of repeats.

( ) Parenthesis: Put characters inside of parenthesis, if you want to refer the matched string again. To refer, use back slash and a number which indicates the position of the parenthesis. For example, ([ac]b)\1 mathes either abab or cbcb because \1 is referring to the string in the parenthesis, which is either ab or cb. I explained a little more about usage of parenthesis in another post. Parenthesis is also used together with pipe |. Pipe is alternation, in math it is equivalent to OR. So if you want to match a string abc OR acc, you can do like (abc|acc).

( ) Parenthesis with ?: When ? is inside of parenthesis, it works quite uniquely dependent on what comes after that (e.g. =,!,<=,<!,>).  It is used to capture substrings of an input strings. Why do you need this kind of function? Let’s say you want to capture “gold” in the sentence “I like the gold coin.”, but only when it is followed by “coin”.  Cases like this, when you want capture something with conditions but you don’t want to capture the conditions), parethesis with ? inside will be used.

For example, if it is together with equal sign (=), it will look for a substring preceding the parenthesis with conditions, but it only capture the substring outside of parenthesis.

gold(?=\scoin)              I really like the gold coin.

Red part is what is captured, and green is the condition met in the parenthesis. To do negative assertion, you use exclamation ! instead of equal =. For example, if you want to capture “gold” not followed by ” coin”.

gold(?!\scoin\b)           I really like gold medals but not gold coins.

It captures first “gold”, because it is not followed by “coin”, second “gold” is NOT captured because \b defines the end of the word. If you remove the \b, it will capture both “gold”s. If your condition is before the string you want to match, you put < between ? and =.

There are more ways to capture a string with conditions using different special characters and they are also quite useful. I am not going to discuss with these, please refer here.

If you want to quickly test how your regular expression behaves, you can use regular expression test web site (1 & 2) or free text editor such as EditPad Pro.

Also you can use a lot of random words to check what actually match with your regular expression.

Other useful links

Common regular expressions (e.g. URL, email and etc)

Regular expression library

More regular expression examples

Batch X!tandem Search on Linux

Screen Shot 2013-07-30 at 6.50.42 PM

Everyday I generate 6-10 MS/MS spectrum files and I got tired of repeated clicking and typing of file names, output names species for X! tandem. I want to automate the search, so I can do something else by saving time. I know there is a program for batch X! tandem, but you may not necessarily have graphical interface in your linux system (at least my linux core doesn’t) . So it is useful and more flexible if you can do this in command line. OK,  in order to run X! tandem, you need 4 files in the same directory
1) input.xml
2) default.xml
3) taxonomy.xml
4) tandem.exe
MS/MS spectrum file name and output file name  are the one you change often and these are stored in input.xml file. In order to automate search, there are several ways to do it.

1) Create as  many input.xml files as MS/MS spectrum files and then write a script to sequentially run tandem.exe
2) Create a file that contains MS/MS spectrum file names, then write a script to read it line by line, and modify input.xml file. Execute tandem.exe until all files are searched.
3) Place all MS/MS spectrum files in one directory, and write a script to run tandem.exe for all files in the directory

For 1), if you have only a few files, it is easy to implement. But if you have more files (>10), it is cumbersome  and 2) will work better. If you have many files (>50), typing (or copying + pasting) file names take time, so 3) will work the best.

Here, I am going to show you how to implement method 2). First you create a file, let’s say called “file_name.txt”. This file contains all MS/MS spectrum file names and directory information. For example,


Place this file in the same directory as all the other necessary files listed above. Then write shell scripts to automate the search.

1   while read line
2   do
3     echo -e “Writing $line in input.xml\n”
4     sed ‘s=spectrum, path”>*.*<=spectrum, path”>’$line'<=’ <input.xml  >input1.xml
5     sed ‘s=output, path”>*.*<=output, path”>’${line%.*}’_output.xml<=’ <input1.xml >input2.xml
6     ./tandem.exe input2.xml
done < file_name.txt

I used while loop to read each line in file_name.txt until it reaches to the end. Each line is stored in a variable ($line) and I want to insert this variable in the certain places in the txt file. Now if you look at input.xml file in X! tandem, input file and outputfile names are defined in two lines (2nd and 3rd line from the end).

Screen Shot 2013-07-30 at 11.11.41 PM

SED command is very useful to find & replace a character string. The basic format is

sed /s/abc/def/ <file

Here string abc is replaced with def if it finds in file. Slash (/) is used as delimiter. However, you have to be careful what delimiter you want to use. As $line contains slash, you cannot use slash as delimiter. I used equal character (=) which is not used in this regular expression. For more detailed usage of SED command, click here.

Finally, results will be written as input_file_name_output_XXXX_XX_XX_XX_XX_XX.t.xml in the same directory as the input MS/MS files.

Modify Protein Database Using Regular Expression

I posted a blog earlier regarding requirements for protein database using PeptideShaker. That is to have a decoy sequence and having a tag “_REVERSED” for each decoy sequence. How can I add the tag to all protein (tens of thousand) entries?
Regular expression used in various systems is very useful to convert text in a certain manner. I am going to show you one way to do this. First I am going to use a text editor called EditPad Pro. This is a great text editor and you can use many tools for free, but for full version you have to pay.

Once you installed the program and open your fasta file that contains both target and decoy sequences with this application. There are many applications to create targe+decoy sequences, so I am not discussing about it here. In this example, I used COMPASS to generate the file. Then go Find or CTL+F to search & replace text. You should see a window like this below. This example is bovine sequences from Uniprot, which is recommended by Peptide Shaker.


You see two while boxes at the bottom, one for searching text, and the other one is for the text replaced with.

Before you do any operation, you need to check how many entries in this fasta database.  To do it, just type “^>“, select regular expression option and start from beginning and click the button for count matches. The hat “^” represents the beginning of each line.


And remember this number (48466 entries), just write it down somewhere. Then look at the first and second entry,

>Tr|A0JB29|A0JB29_BOVIN Bucentaur-2 OS=Bos taurus..

>DECOY_tr|A0JBZ9|A0JBZ9_BOVIN Bucentaur-2 OS=Bos taurus..

You want to remove “DECOY_tr” and then add “_REVERSED” tag right before the second “|” for every entry in the database.

>Tr|A0JB29_REVERSED|A0JB29_BOVIN Bucentaur-2 OS=Bos taurus..

It is pretty easy to do for a few entries, but how about another 24 thousands?

Let’s look at the entry more carefully. Each entry starts with “>DECOY_” and two alphabets (Tr), then the protein ID is separated by two “|”s.


You don’ t necessarily have to break up into two parts, but this is just an exercise and this will make it more flexible in the future. Here is the regular expression you need to capture the beginning of each entry.


The first ( ) will capture the “Tr” part and second ( ) will capture protein ID part “AoJB29”. The square bracket [ ] represents one letter that matches the character inside of the bracket.  The pipe “|” is a special character, so you need to use backslash followed by “|”. “*” is a wild card that can be any character, and if you have “*.*”, it will contain any characters until it finds the next “|”.

OK, let’s see if it works. You select regular expression option, and start from beginning. Then hit search.


Can you see the text in the first entry which is highlighted in blue? Try hitting search a few times to see if it finds the right piece from each entry.

Then try again by selecting regular expression and start from beginning. This time, you click “Count matches” instead of “Search”.


You see it found the search text for 24233 times. Since this is exactly the half of 48466 (all entries), the regular expression successfully capture all decoy entries.

Then how can you replace the text with “_REVERSED” tag? This is the regular expression for the replacement.


\1 is the first part “tr” and \2 is the second part “A0JB29”. Then tag is added followed by a pipe. Try testing a few more to see if it changes correctly. If it does, click “Replace All”. Finally, check the entry by eyes to see everything goes ok. That’s it!

%d bloggers like this: