My favorite commands Part2: Data reduction using ddply

In the previous post, I showed how to split the cell by a delimiter and copy other cells to expand the spread sheet. Sometimes you want to do the opposite, which is to reduce the number of rows which contain common IDs. For example, I have a spreadsheet that contains gene ID and length and GC content of 3’UTR for rat. As you know there are many genes with various 3’UTRs, therefore you have multiple rows for each gene. Now I may want to make a summary of 3’UTR length and GC content by averaging. Maybe I want to create a column with ID, min length, avg length, maximum length and so on. In this post, I am going to show you a simple example of how to reduce data by applying function to multiple rows with common IDs.

Screen Shot 2014-07-27 at 8.45.40 AM

You see that first column entrez ID is an identifier which appears multiple times in the data. The data were sorted, but it doesn’t have to be. The goal of simple task is to create a new data frame with ID, average of length (2nd column) and GC content (3rd column).

Screen Shot 2014-07-27 at 8.45.45 AM

How do you accomplish this task? There is a very convenient command in R called ddply. Let’s install and load package called plyr.

>install.packages("plyr")
>library(plyr)

The help page of ddply command shows

ddply(.data, .variables, .fun = NULL, ..., .progress =
 "none",.inform = FALSE, .drop = TRUE, .parallel = FALSE,
 .paropts = NULL)

They key is the first three arguments. The first one is the dataframe you want to  work with. The second one is the column name to get rows with the same information. In the example above, that would be “entrez ID”. The third argument, function, is the most critical one. The command will first look at the top cell of the column specified by the second parameter. Then it will find all rows with the same information. For example, assumethe following  ddply function is called,

>avgdata<-ddply(data1,.(entrezID),calavg)

If data1 is the dataframe of the example on the top of page, ddply will grab the first entry in the second argument “entreZID”, which is “24244”. Then it will look for other rows that also contain “24244” in entrezID. In this example, there are two rows with the same “entrezID”, it will send a dataframe with theses two rows to the function of the third argument. Screen Shot 2014-07-27 at 4.02.09 PM

So you need to create a function which takes a dataframe like this  and returns a dataframe with average data below. Screen Shot 2014-07-27 at 4.04.24 PM

Let’s create a function called calavg

calavg<-function(x){
   avg<-apply(x[,2:3],2,mean)
   return(avg)
}

This simple function takes a dataframe and applies column-wide  mean function. The results are stored in a dataframe “avg”, which will be returned upon completion. So once you call ddply script, you should get:

> ddply(data1,.(entrezID),calavg)
entrezID length_3utr GC_3utr
1 24244 1789.5000 0.4593695
2 24346 157.0000 0.4076433
3 24356 3394.5000 0.4298199
4 100912619 114.3333 0.4249794

Writing a proper function is the most critical and it is not hard to implement more complicated methods for data reduction. For those who want to explore more about ddply, please refer below.

http://seananderson.ca/courses/12-plyr/plyr_2012.pdf

Advertisements

About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: