Modify Protein Database Using Regular Expression

I posted a blog earlier regarding requirements for protein database using PeptideShaker. That is to have a decoy sequence and having a tag “_REVERSED” for each decoy sequence. How can I add the tag to all protein (tens of thousand) entries?
Regular expression used in various systems is very useful to convert text in a certain manner. I am going to show you one way to do this. First I am going to use a text editor called EditPad Pro. This is a great text editor and you can use many tools for free, but for full version you have to pay.

Once you installed the program and open your fasta file that contains both target and decoy sequences with this application. There are many applications to create targe+decoy sequences, so I am not discussing about it here. In this example, I used COMPASS to generate the file. Then go Find or CTL+F to search & replace text. You should see a window like this below. This example is bovine sequences from Uniprot, which is recommended by Peptide Shaker.


You see two while boxes at the bottom, one for searching text, and the other one is for the text replaced with.

Before you do any operation, you need to check how many entries in this fasta database.  To do it, just type “^>“, select regular expression option and start from beginning and click the button for count matches. The hat “^” represents the beginning of each line.


And remember this number (48466 entries), just write it down somewhere. Then look at the first and second entry,

>Tr|A0JB29|A0JB29_BOVIN Bucentaur-2 OS=Bos taurus..

>DECOY_tr|A0JBZ9|A0JBZ9_BOVIN Bucentaur-2 OS=Bos taurus..

You want to remove “DECOY_tr” and then add “_REVERSED” tag right before the second “|” for every entry in the database.

>Tr|A0JB29_REVERSED|A0JB29_BOVIN Bucentaur-2 OS=Bos taurus..

It is pretty easy to do for a few entries, but how about another 24 thousands?

Let’s look at the entry more carefully. Each entry starts with “>DECOY_” and two alphabets (Tr), then the protein ID is separated by two “|”s.


You don’ t necessarily have to break up into two parts, but this is just an exercise and this will make it more flexible in the future. Here is the regular expression you need to capture the beginning of each entry.


The first ( ) will capture the “Tr” part and second ( ) will capture protein ID part “AoJB29”. The square bracket [ ] represents one letter that matches the character inside of the bracket.  The pipe “|” is a special character, so you need to use backslash followed by “|”. “*” is a wild card that can be any character, and if you have “*.*”, it will contain any characters until it finds the next “|”.

OK, let’s see if it works. You select regular expression option, and start from beginning. Then hit search.


Can you see the text in the first entry which is highlighted in blue? Try hitting search a few times to see if it finds the right piece from each entry.

Then try again by selecting regular expression and start from beginning. This time, you click “Count matches” instead of “Search”.


You see it found the search text for 24233 times. Since this is exactly the half of 48466 (all entries), the regular expression successfully capture all decoy entries.

Then how can you replace the text with “_REVERSED” tag? This is the regular expression for the replacement.


\1 is the first part “tr” and \2 is the second part “A0JB29”. Then tag is added followed by a pipe. Try testing a few more to see if it changes correctly. If it does, click “Replace All”. Finally, check the entry by eyes to see everything goes ok. That’s it!

About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: