My favorite commands Part1: Split and copy using AWK

I haven’t updated my blog for while…… I have been pretty busy since the beginning of 2014 but that was my excuse, shame on me.  Some people told me my blog is useful and that gave me motivation to update. Let’s do it!!

Before I started extensively using R or linux command line  as a main tool for manipulating data, my main data analysis software has been microsoft excel or access program. They do work for many things, but certain things are more difficult to implement.  For example,  you have data like this

Screen Shot 2014-07-18 at 1.12.14 PM

These are real data I got for 3’UTR sequences from Rat. Notice that you have 4 columns and the last column has Entrez ID where two cells contain multiple IDs separated by semicolon. The task is to split the Entrez ID column for individual IDs and create new row(s) with the first three columns. So the final data looks like this.

Screen Shot 2014-07-18 at 1.08.32 PM

Do you see what I want to accomplish? Colored parts are identical, notice that Entrez ID column has only single entry now. Before, IDs in a single cell were separated by “;”. After split, each entrez ID has its own row. This can be done in two command line scripts in linux/mac terminal.

1) Add “;” between the third and fourth column.  More precisely, after the third delimiter.

>awk -F “\t” ‘{print $1″\t”$2″\t”$3″\t;”$4}’ rat_3utr.xls > temp

2) Split Entrez IDs and copy the first three columns

>awk -F “;” ‘{print $1″ “$2}{for (i=3;i<=NF;i++)print $1” “$i}’ temp > rat_3utr_single_ID.xls

3) remove temp file

>rm temp

 Syntax of AWK using -F option

AWK is extremely useful command to manipulate text. AWK is available in both linux and mac command line without installation.  What it does is you specify the pattern of string, and action(s) provided on matched strings. With -F option, you specify delimiter {e.g. \t for tab}, then the awk command will split the each line from the input file  into the parts separated by the delimiter. The parts are stored in $1, $2, $3 … in the order. The total number of parts are stored in NF. Here is the basic syntax of AWK with -F option

awk -F delimiter” ’{action 1}{action 2}{…}{action n}’ input_file

So 1) is essentially splitting the line  into 4 parts separated by tab, stored the parts in $1 ~ $4. Then, print the first 3 parts followed by “;” and the last part ($4). The second line is similar, but this time, “;” is the delimiter so $1 contains three columns and $2 is the first Entrez ID. You repeat this printing process until it reaches to the last ID.

In order for these scripts to work, you need

1) the column you have multiple IDs are the far right.

2) The data are tab delimited (you can have any delimiter you like, just specify the right one).

3) Use for loop or something in the first command if you have lots of columns

The script should work for many IDs in the last column and complete the task very fast.


About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: