Archive | July 2014

My favorite commands Part2: Data reduction using ddply

In the previous post, I showed how to split the cell by a delimiter and copy other cells to expand the spread sheet. Sometimes you want to do the opposite, which is to reduce the number of rows which contain common IDs. For example, I have a spreadsheet that contains gene ID and length and GC content of 3’UTR for rat. As you know there are many genes with various 3’UTRs, therefore you have multiple rows for each gene. Now I may want to make a summary of 3’UTR length and GC content by averaging. Maybe I want to create a column with ID, min length, avg length, maximum length and so on. In this post, I am going to show you a simple example of how to reduce data by applying function to multiple rows with common IDs.

Screen Shot 2014-07-27 at 8.45.40 AM

You see that first column entrez ID is an identifier which appears multiple times in the data. The data were sorted, but it doesn’t have to be. The goal of simple task is to create a new data frame with ID, average of length (2nd column) and GC content (3rd column).

Screen Shot 2014-07-27 at 8.45.45 AM

How do you accomplish this task? There is a very convenient command in R called ddply. Let’s install and load package called plyr.

>install.packages("plyr")
>library(plyr)

The help page of ddply command shows

ddply(.data, .variables, .fun = NULL, ..., .progress =
 "none",.inform = FALSE, .drop = TRUE, .parallel = FALSE,
 .paropts = NULL)

They key is the first three arguments. The first one is the dataframe you want to  work with. The second one is the column name to get rows with the same information. In the example above, that would be “entrez ID”. The third argument, function, is the most critical one. The command will first look at the top cell of the column specified by the second parameter. Then it will find all rows with the same information. For example, assumethe following  ddply function is called,

>avgdata<-ddply(data1,.(entrezID),calavg)

If data1 is the dataframe of the example on the top of page, ddply will grab the first entry in the second argument “entreZID”, which is “24244”. Then it will look for other rows that also contain “24244” in entrezID. In this example, there are two rows with the same “entrezID”, it will send a dataframe with theses two rows to the function of the third argument. Screen Shot 2014-07-27 at 4.02.09 PM

So you need to create a function which takes a dataframe like this  and returns a dataframe with average data below. Screen Shot 2014-07-27 at 4.04.24 PM

Let’s create a function called calavg

calavg<-function(x){
   avg<-apply(x[,2:3],2,mean)
   return(avg)
}

This simple function takes a dataframe and applies column-wide  mean function. The results are stored in a dataframe “avg”, which will be returned upon completion. So once you call ddply script, you should get:

> ddply(data1,.(entrezID),calavg)
entrezID length_3utr GC_3utr
1 24244 1789.5000 0.4593695
2 24346 157.0000 0.4076433
3 24356 3394.5000 0.4298199
4 100912619 114.3333 0.4249794

Writing a proper function is the most critical and it is not hard to implement more complicated methods for data reduction. For those who want to explore more about ddply, please refer below.

http://seananderson.ca/courses/12-plyr/plyr_2012.pdf

Advertisements

My favorite commands Part1: Split and copy using AWK

I haven’t updated my blog for while…… I have been pretty busy since the beginning of 2014 but that was my excuse, shame on me.  Some people told me my blog is useful and that gave me motivation to update. Let’s do it!!

Before I started extensively using R or linux command line  as a main tool for manipulating data, my main data analysis software has been microsoft excel or access program. They do work for many things, but certain things are more difficult to implement.  For example,  you have data like this

Screen Shot 2014-07-18 at 1.12.14 PM

These are real data I got for 3’UTR sequences from Rat. Notice that you have 4 columns and the last column has Entrez ID where two cells contain multiple IDs separated by semicolon. The task is to split the Entrez ID column for individual IDs and create new row(s) with the first three columns. So the final data looks like this.

Screen Shot 2014-07-18 at 1.08.32 PM

Do you see what I want to accomplish? Colored parts are identical, notice that Entrez ID column has only single entry now. Before, IDs in a single cell were separated by “;”. After split, each entrez ID has its own row. This can be done in two command line scripts in linux/mac terminal.

1) Add “;” between the third and fourth column.  More precisely, after the third delimiter.

>awk -F “\t” ‘{print $1″\t”$2″\t”$3″\t;”$4}’ rat_3utr.xls > temp

2) Split Entrez IDs and copy the first three columns

>awk -F “;” ‘{print $1″ “$2}{for (i=3;i<=NF;i++)print $1” “$i}’ temp > rat_3utr_single_ID.xls

3) remove temp file

>rm temp

 Syntax of AWK using -F option

AWK is extremely useful command to manipulate text. AWK is available in both linux and mac command line without installation.  What it does is you specify the pattern of string, and action(s) provided on matched strings. With -F option, you specify delimiter {e.g. \t for tab}, then the awk command will split the each line from the input file  into the parts separated by the delimiter. The parts are stored in $1, $2, $3 … in the order. The total number of parts are stored in NF. Here is the basic syntax of AWK with -F option

awk -F delimiter” ’{action 1}{action 2}{…}{action n}’ input_file

So 1) is essentially splitting the line  into 4 parts separated by tab, stored the parts in $1 ~ $4. Then, print the first 3 parts followed by “;” and the last part ($4). The second line is similar, but this time, “;” is the delimiter so $1 contains three columns and $2 is the first Entrez ID. You repeat this printing process until it reaches to the last ID.

In order for these scripts to work, you need

1) the column you have multiple IDs are the far right.

2) The data are tab delimited (you can have any delimiter you like, just specify the right one).

3) Use for loop or something in the first command if you have lots of columns

The script should work for many IDs in the last column and complete the task very fast.

 

%d bloggers like this: