Archive | August 2014

My favorite commands Part3: ‘sweep’ function in R

In the last two posts, I described useful commands either to expand or reduce your data. In R, “apply” function is used to apply a function you specify to the data frame. For example, you can use “mean” function to calculate mean values for each column and row.

data<-matrix(seq(1,12),ncol=4,nrow=3,byrow=TRUE)

> data
 [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12

> colMean<-apply(data,2,mean)
> colMean
[1] 5 6 7 8

> rowMean<-apply(data,1,mean)
> rowMean
[1] 2.5 6.5 10.5

You see that to apply mean function by column-wise manner, you use “2” in the second argument For row-wise mean, you use “1”.

The third argument in “apply” can be a user-defined one, as long as it takes the right data.  For example,

myfunc<-function(x){
t=1
for (i in x){t=t*i}
return(t)
}

This function multiplies all the numbers in x and return the value.

> multCol<-apply(data,2,myfunc)
> multCol
[1] 45 120 231 384

“apply” function is useful to this kind of simple column-wise or row-wise calculation, however it would be cumbersome if you want to add different values for each column/row.

sweep function in R

R has a convenient function to apply different values to data in different columns/rows. For example, you want to subtract “3”, “4”,”5″ ,”6″ from each value in the first, 2nd, 3rd and the last column. You can do this by simply applying sweep function.

> sweep(data,2,c(3,4,5,6),”-“)
[,1] [,2] [,3] [,4]
[1,] -2 0 2 4
[2,] -1 1 3 5
[3,] 0 2 4 6

Basic syntax of sweep function is like this

Screen Shot 2014-08-12 at 10.19.59 AM

sweep takes four arguments to work. The first one is the data you want to modify. The second one is the margin, if it is 1 then function is applied in row-wise manner. If it is 2, it is applied in column-wise manner. This is the same as “apply” function. The third argument is a vector with the same length as column or row. If the second argument is “1′, then the vector has to have the same length as row. If it is “2”, the length of vector needs to be the same length as column. The last argument is a binary operator, such as  – ,  + , * , / , < and >.

Standardizing data matrix using sweep

Gene expression data usually have large ranges and it is sometimes useful to standardize the data to see which genes go up or down more dramatically than others. The mathematics is pretty simple:

Screen Shot 2014-08-12 at 11.12.13 AM

You have normalized expression data first,  calculate median expression for each gene and then subtract the value and divide the results by median absolute deviation. Since we are subtracting median expression which differs for each gene, sweep can come into play here. By standardizing, all genes are centered to their median expression values and expression change is also normalized by dividing with median absolute deviation. In this way, gene expression changes are not biased by expression levels.

Here is the code:

standardize <- function(z) {
rowmed <- apply(z, 1, median)
rowmad <- apply(z, 1, mad)  # median absolute deviation
rv <- sweep(z, 1, rowmed,"-")  #subtracting median expression
rv <- sweep(rv, 1, rowmad, "/")  # dividing by median absolute deviation
return(rv)
}

Test this function

>data<-data.frame(data)
>colnames(data)<-c("sample1","sample2","sample3","sample4")
>rownames(data)<-c("gene1","gene2","gene3")
>standardize(data)
      sample1   sample2    sample3   sample4
gene1 -1.011736 -0.3372454 0.3372454 1.011736
gene2 -1.011736 -0.3372454 0.3372454 1.011736
gene3 -1.011736 -0.3372454 0.3372454 1.01173

There is a pretty similar function called “scale” in R, but it is slightly different from this function. If “scale” function is used, you will get:

> scale(t(data),scale=T)
             gene1      gene2      gene3
sample1 -1.1618950 -1.1618950 -1.1618950
sample2 -0.3872983 -0.3872983 -0.3872983
sample3  0.3872983  0.3872983  0.3872983
sample4  1.1618950  1.1618950  1.1618950

Note that you need to transpose the data because “scale” function works column wise-manner. You see that values are not the same because “scale” function uses standard deviation instead of median absolute deviation. To get identical results, you use a vector with median absolute deviation in the second argument for “scale” function.

> scale(t(data),scale=apply(data,1,mad))
gene1 gene2 gene3
sample1 -1.0117361 -1.0117361 -1.0117361
sample2 -0.3372454 -0.3372454 -0.3372454
sample3 0.3372454 0.3372454 0.3372454
sample4 1.0117361 1.0117361 1.0117361
Advertisements
%d bloggers like this: