Archive | R RSS for this section

My favorite commands Part2: Data reduction using ddply

In the previous post, I showed how to split the cell by a delimiter and copy other cells to expand the spread sheet. Sometimes you want to do the opposite, which is to reduce the number of rows which contain common IDs. For example, I have a spreadsheet that contains gene ID and length and GC content of 3’UTR for rat. As you know there are many genes with various 3’UTRs, therefore you have multiple rows for each gene. Now I may want to make a summary of 3’UTR length and GC content by averaging. Maybe I want to create a column with ID, min length, avg length, maximum length and so on. In this post, I am going to show you a simple example of how to reduce data by applying function to multiple rows with common IDs.

Screen Shot 2014-07-27 at 8.45.40 AM

You see that first column entrez ID is an identifier which appears multiple times in the data. The data were sorted, but it doesn’t have to be. The goal of simple task is to create a new data frame with ID, average of length (2nd column) and GC content (3rd column).

Screen Shot 2014-07-27 at 8.45.45 AM

How do you accomplish this task? There is a very convenient command in R called ddply. Let’s install and load package called plyr.

>install.packages("plyr")
>library(plyr)

The help page of ddply command shows

ddply(.data, .variables, .fun = NULL, ..., .progress =
 "none",.inform = FALSE, .drop = TRUE, .parallel = FALSE,
 .paropts = NULL)

They key is the first three arguments. The first one is the dataframe you want to  work with. The second one is the column name to get rows with the same information. In the example above, that would be “entrez ID”. The third argument, function, is the most critical one. The command will first look at the top cell of the column specified by the second parameter. Then it will find all rows with the same information. For example, assumethe following  ddply function is called,

>avgdata<-ddply(data1,.(entrezID),calavg)

If data1 is the dataframe of the example on the top of page, ddply will grab the first entry in the second argument “entreZID”, which is “24244”. Then it will look for other rows that also contain “24244” in entrezID. In this example, there are two rows with the same “entrezID”, it will send a dataframe with theses two rows to the function of the third argument. Screen Shot 2014-07-27 at 4.02.09 PM

So you need to create a function which takes a dataframe like this  and returns a dataframe with average data below. Screen Shot 2014-07-27 at 4.04.24 PM

Let’s create a function called calavg

calavg<-function(x){
   avg<-apply(x[,2:3],2,mean)
   return(avg)
}

This simple function takes a dataframe and applies column-wide  mean function. The results are stored in a dataframe “avg”, which will be returned upon completion. So once you call ddply script, you should get:

> ddply(data1,.(entrezID),calavg)
entrezID length_3utr GC_3utr
1 24244 1789.5000 0.4593695
2 24346 157.0000 0.4076433
3 24356 3394.5000 0.4298199
4 100912619 114.3333 0.4249794

Writing a proper function is the most critical and it is not hard to implement more complicated methods for data reduction. For those who want to explore more about ddply, please refer below.

http://seananderson.ca/courses/12-plyr/plyr_2012.pdf

Rmpi Tutorial 4: Getting Data Back From Slaves

This is going to be the last tutorial for Rmpi. In this post I am going to cover how to receive data from slaves in Rmpi. Let’s think of a situation in the picture below. You want to gather data from slaves and combine them with the data in the master.

mpi_gather_Robj

mpi.gather.Robj() will retrieve data from each slave and put them together like the diagram above.  Let’s try some examples

Example 1: Getting slave number information from each slave

library('Rmpi')
mpi.spawn.Rslaves(nslaves=3)
mpi.bcast.cmd(id<-mpi.comm.rank())
mpi.bcast.cmd(x<-paste("I am slave no.",id))
mpi.bcast.cmd(mpi.gather.Robj(x))
x<-"I am a master"
mpi.gather.Robj(x)
mpi.remote.exec(x)
mpi.close.Rslaves()

Here is the output (showing only the last part)

> mpi.gather.Robj(x)
[1] "I am a master"    "I am slave no. 1" 
"I am slave no. 2" "I am slave no. 3"
> mpi.remote.exec(x)
$slave1
[1] "I am slave no. 1"
$slave2
[1] "I am slave no. 2"
$slave3
[1] "I am slave no. 3"

If you want to retrieve data from each slave and give the whole data to all slaves, you use mpi.allgather.Robj().
mpi_allgather_Robj

Example 2: Send a string “fruit” to master” and “apple”, “banana” and “orange” to slave 1 to 3. Then retrieve data from each slave and send the all data to master and all slaves.

library('Rmpi')
mpi.spawn.Rslaves(nslaves=3)
x<-c("fruits","apple","banana","orange")
mpi.bcast.cmd(x<-mpi.scatter.Robj())
x<-mpi.scatter.Robj(x)
mpi.remote.exec(x)
mpi.bcast.cmd(x<-mpi.allgather.Robj(x))
mpi.allgather.Robj(x)
mpi.remote.exec(x)
mpi.close.Rslave()

Here is the output

> mpi.remote.exec(x) 
$slave1
[1] "apple" 
$slave2 
[1] "banana" 
$slave3 
[1] "orange" 
....
>mpi.allgather.Robj(x)
[1] "fruits" "apple" "banana" "orange" 
>mpi.remote.exec(x) 
$slave1 
[1] "fruits" "apple" "banana" "orange" 
$slave2 
[1] "fruits" "apple" "banana" "orange" 
$slave3 
[1] "fruits" "apple" "banana" "orange"

mpi.reduce and mpi.allreduce Will Reduce Data By Simple Operation

mpi_reduce_maxloc_minloc

mpi.reduce command examines a variable in the slaves & the master, do simple operation such as finding minimum or maximum value then return the value. The variable needs to exist in every slave including master, the returned value is a single value. In order for it to work, you need to call this command from all slaves and master, otherwise it will go to infinite loop.

Example 3: Set a value of x to 1 in the master, 2,3, & 4 in slave 1, 2, & 3.  Then using mpi.reduce to return a sum of all x.

library('Rmpi')
mpi.spawn.Rslaves(nslave=3)
# Define function for reduction
red<-function(option="sum"){
    mpi.reduce(x,type=2,op=option)
}
# Send a function to all slaves
mpi.bcast.Robj2slave(red)
# Set object x and send to slaves
x<-c(1,2,3,4)
mpi.bcast.cmd(x<-mpi.scatter.Robj())
x<-mpi.scatter.Robj(x)
mpi.remote.exec(x)
# call the function in slaves
mpi.remote.exec(red("sum"))
# call the same function in master
mpi.reduce(x,2,"sum")
mpi.close.Rslaves()

Here is the output

> mpi.remote.exec(red("sum"))
  X1 X2 X3
1  2  3  4
> mpi.reduce(x,2,"sum")
[1] 10

If you use mpi.allreduce, it will send the final value to all slaves. There are two more options in mpi.reduce and they are maxloc and minloc. If you use these options, the command will return two values, the value resulting from the operation (either minimum or maximum) and the location of the value. This can be useful to find the slave which provides the value.

> mpi.reduce(x,2,"maxloc")
[1] 4 3
> mpi.reduce(x,2,"minloc")
[1] 1 0

Note: the rank for the master is 0

Rmpi Tutorial 2: Sending Data

In MPI, there are a number of commands for sending data. It is important to meet the requirement when you send data to slaves, otherwise R will crash. I am going to cover a few important ones in this post.

Sending Identical Data to Slave CPUs

For large scale data analysis, you may want to divide the work to slave CPUs. These CPUs may need to receive some constant values that are necessary for downstream computation. To send an object to each slave, you can use mpi.bcast.Robj() command.

Example: Send a character string to 3 slaves

1. Spawn 3 slaves and create an variable x in each slave.

>mpi.spawn.Rslaves(nslaves=3)
>mpi.bcast.cmd(x<-mpi.bcast.Robj())
>x<-c("This is a test.")

2. Send x to each slave

>mpi.bcast.Robj(x)

3. Print x in each slave

>mpi.remote.exec(x)
$slave1
[1] "This is a test."
$slave2
[1] "This is a test."
$slave3
[1] "This is a test."

4. Close mpi

>mpi.close.Rslaves()

Sending Non-identical Data to Slave CPUs

When you have a large set of data, first you divide the data and put them in a list, then send the list to each slave. There are two commands mpi.scatter.Robj() and mpi.scatter.Robj2slave() to send list to slave CPUs.

When you use these commands, you need to have exactly the same number of object potions as the number of slave CPUs. If not, you will get an error message. For example, if you spawn 3 slave CPUs, and your object to send is a list of 4, mpi.scatter.Robj2slave() will not work because you don’t have the equal numbers. However, mpi.scatter.Rbj() will work because master plus slaves equal to the number of list in the object.

Example 1: Split and send an object of list of 4 to master and slaves

 mpiscatterrobj

This code is very similar to the one above except that x is a list of character.

>mpi.spawn.Rslaves(nslaves=3)
>x<-c("This","is","an","example")
>mpi.bcast.cmd(x<-mpi.scatter.Robj())
>mpi.scatter.Robj(x)
[1] "This"
>mpi.remote.exec(x)
>$slave1
[1] "is"
$slave2
[1] "an"
$slave3
[1] "example"
>mpi.close.Rslaves()

Note that master receive object x but it does not overwrite existing x.

Example 2: Divide 8×4 matrix into 4 blocks and send to slaves

mpiscatterrobj2
1. Spawn 4 slave CPUs, and create a 8×4 matrix with random numbers

>mpi.spawn.Rslaves(nslaves=4)
>mat<-matrix(rnorm(32),8)
>mat
           [,1]       [,2]       [,3]       [,4]
[1,] -0.3718508  0.8075626 -0.1145767  1.2152244
[2,]  1.2414776 -1.7983161  0.8113792 -0.0577753
[3,]  0.2291987 -1.8194346  0.5902288 -0.5519079
[4,] -0.6056088  0.7028118 -1.0299552  1.4069104
[5,]  0.5006542  1.2469203 -1.5266182 -0.2712369
[6,]  0.9899981  1.0211666 -1.0916166  0.9721620
[7,] -1.6689545  0.2618148 -1.0774920 -0.4962599
[8,] -0.3919911  0.1678641  0.5198690  0.7932334

2. Split the matrix into 4 of 2×4 matrices

>smat<-lapply(.splitIndices(nrow(mat),4),function(i)
mat[i,])
>smat
[[1]]
           [,1]       [,2]       [,3]       [,4]
[1,] -0.3718508  0.8075626 -0.1145767  1.2152244
[2,]  1.2414776 -1.7983161  0.8113792 -0.0577753

[[2]]
           [,1]       [,2]       [,3]       [,4]
[1,]  0.2291987 -1.8194346  0.5902288 -0.5519079
[2,] -0.6056088  0.7028118 -1.0299552  1.4069104

[[3]]
          [,1]     [,2]      [,3]       [,4]
[1,] 0.5006542 1.246920 -1.526618 -0.2712369
[2,] 0.9899981 1.021167 -1.091617  0.9721620

[[4]]
           [,1]      [,2]      [,3]       [,4]
[1,] -1.6689545 0.2618148 -1.077492 -0.4962599
[2,] -0.3919911 0.1678641  0.519869  0.7932334

3. Send each matrix to slave CPUs

> mpi.scatter.Robj2slave(smat)
> mpi.remote.exec(smat)
$slave1
[,1] [,2] [,3] [,4]
[1,] -0.3718508 0.8075626 -0.1145767 1.2152244
[2,] 1.2414776 -1.7983161 0.8113792 -0.0577753

$slave2
[,1] [,2] [,3] [,4]
[1,] 0.2291987 -1.8194346 0.5902288 -0.5519079
[2,] -0.6056088 0.7028118 -1.0299552 1.4069104

$slave3
[,1] [,2] [,3] [,4]
[1,] 0.5006542 1.246920 -1.526618 -0.2712369
[2,] 0.9899981 1.021167 -1.091617 0.9721620

$slave4
[,1] [,2] [,3] [,4]
[1,] -1.6689545 0.2618148 -1.077492 -0.4962599
[2,] -0.3919911 0.1678641 0.519869 0.7932334

4. Close mpi

>mpi.close.Rslaves()

Note: The results may vary as the seed was not set for this code.

MPI Tutorial for R (Rmpi)

In the previous two posts, I introduced what MPI is and how to install MPI for R programing language. Rmpi provides an interface necessary to use MPI for parallel computing using R. Rmpi is maintained by Hao Yu at University of Western Ontario and it has been around for about a decade now. Although it doesn’t have all commands found in original MPI for C/Fortran, quite a few functions have been added and it has most of basic functions for normal operations. The manual for Rmpi is provided here.

In this post, I am going to cover a few basic commands/functions for MPI using R.

Spawning Slave CPUs

In MPI term, master is the main CPU that sends messages to dependent CPUs called slaves to complete some tasks . When you spawn slaves using mpi.spawn.Rslaves(), first it gets the number of available CPUs by default setting (depending on your system). You can use nslave option to define the specific number of CPUs you want to use for MPI. You can use higher number than actual CPUs available in your system, but you will not get any benefit from doing it. It behaves as if it has the number of CPUs, but actual computation is done by available CPUs.

Screen Shot 2013-11-23 at 11.48.54 AM

Lets Execute A Command Using Slaves

There are several commands to execute codes in slaves. mpi.remote.exec() and mpi.bcast.cmd() are examples. The syntax for mpi.remote.exec() is

>mpi.remote.exec(cmd, …, simplify = TRUE, comm =1, ret =TRUE)

where cmd is a command to be executed on slaves, … is used as argument which will be used for the cmd, simplify is logical argument whether the results to be a dataframe if possible, comm is a communication number (usually 1), and ret is the logical value whether if you want results from executed code from slaves.  If you use  mpi.bcast.cmd() command to execute the following code, the slaves will execute the command but there will be no return values from them.

Let’s ask each slave to give back the slave number.

>mpi.remote.exec(paste("I am",mpi
.comm.rank(),"of",mpi.comm.size()))

$slave1
[1] "I am 1 of 11"

$slave2
[1] "I am 2 of 11"

........

$slave10
[1] "I am 10 of 11"

As you can see mpi.comm.rank() and mpi.comm.size() give the slave CPU number and total size of spawned slaves. The diagram below shows how this command is executed.

command-execution
Another example: asking each slave CPUs to compute the sum of 1 through their rank.

> mpi.remote.exec(sum(1:mpi.comm.rank()))
  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1  1  3  6 10 15 21 28 36 45  55

Measure Time to Compute

To see if your codes need to be paralleled, one can measure the time to compute the task. In R, proc.time() command returns three values and you can use this function to determine the time to compute.
1) user time: the CPU time charged for the execution of the user instructions of the calling process
2) system time: the CPU time charged for execution by the system on behalf of the calling process
3) elapsed time: the time since you logged in current account

Scalability Is Important

Increasing the number of CPUs doesn’t necessarily increase the performance. The overhead, an extra time needed to access the CPUs will increase with more CPUs for parallel computing.  Here is an example of the performance  of a simple code which computes the mean of 1million random numbers for 400 times. The performance increases dramatically from 2 to 4 CPUs (I), then the performance increases more slowly from 4 to 15 CPUs (II). Using More than 16 CPUs  takes more time to compute than 15 CPUs (III) and had no benefit of  doing so. Note: this code was run under sub-optimal interconnect network to show the effect of overhead. Results may vary dependent on your system. Under optimal condition, the time to compute should be halved if you double the number of CPUs.

library('Rmpi')
mpi.spawn.Rslaves(nslaves=4)
ptm<-proc.time() 
mpi.iparReplicate(400, mean(rnorm(1000000)))
print(proc.time() - ptm)

performance_n_CPU

If you use a large number of CPUs for computation, the overhead may significantly affect the overall performance.  So it is important to test your scripts on different numbers of CPUs for the optimal performance. The figure below is the actual performance by reserving 24 CPUs from a large computer cluster. All 24CPUs have high speed interconnect network, therefore performance doubles when number of CPUs doubled (e.g. 3->6 or 4->8). However using more than 10CPUs has no benefit of doing so.

optimal_performance

The commands I covered in this posts are all corrective call, which means that all slaves in a communicator are called for execution. I would like to cover more MPI commands to control individual slave in another post. 

There are three more commands before finishing today’s post. These are mpi.finalize(), mpi.exit() and mpi.quit(). mpi.finalize() should be called to clean all MPI states at the end of the script.  mpi.exit() will not only call mpi.finilize() but also detach the Rmpi library. If mpi.exit() is called, you need to relaunch R to load Rmpi library. mpi.quit() will quit MPI and leave R altogether.

Regular Expression Tutorial 2: Commands in R

The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to write a regular expression to match a string, you may want to manipulate strings such as deletion or replacing. Here is the list of string matching &manipulation commands commonly used with regular expressions in R. These commands also appear in many other languages.

Command          Function
grep( )          Return index of the object 
                 where reg exp found the string
grepl( )         Return logical values for reg exp 
                 matching 
regexpr( )       Return the first position of found
                 string by reg exp
gregexpr( )      Return all positions of found string
                 by regexp
sub( )           Substitute a pattern with a given string
                 (first occurrence only)
gsub( )          Globally substitute a pattern with a 
                 given string (all occurrences) 
substr( )        Return the substring in the giving 
                 character positions (start and stop)
                 in given string
strsplit( )      Split the input string into parts 
                 based on another string (character)
regexec( )       Return the first position of matched 
                 pattern in a given string
regmatches ( )   Extract or replace matched substrings
                 from match data obtained by gregexpr,
                 or regexec

Find & Display Matching string: grep

grep(pattern,vector) 
>x<-c("abc","bcd","cde","def")
>grep("bc",x)
[1] 1 2

The first one is grep() command, which was originally created in Unix system. Its name came from globally search a regular expression and print. You see “bc” appears in the first two entries of x. grep() function returns indexes of the matched string. If you want to show the matched entries (not index),  use value option  or  use square brackets.

>grep("bc",x,value=TRUE)
[1] "abc" "bcd"
>x[grep("bc",x)] 
[1] "abc" "bcd"

Show Matched Pattern Using Find & Replace

If you want to get only the matched pattern, it is kind of awkward but you can use the output above and remove the unmatched part (In linux, you just use grep -o).

First, sub function’s syntax is

sub("matching_string","replacing_string", input_vector)

This function works like “find and replace”. Using this to remove unmatched part.

> sub(".*(bc).*","\\1",grep("bc",x,value=TRUE))
[1] "bc" "bc"

Remember .* means any character with any length and \\1 means the matched string in the first parenthesis. In this case, you see only “bc”, but if you use regular expression for pattern, you will see different kind of matches found in the string.

Remove Matched String

If you want to return indexes of unmatched string, add invert option.

> grep("bc",x,invert=TRUE)
[1] 3 4

Combining with value option, you can remove matched string from the vector

> grep("bc",x,invert=TRUE, value=TRUE)
[1] "cde" "def"

If the search is not case sensitive,

> grep("BC",x,ignore.case=TRUE)
[1] 1 2

If you want to get logical returns for matches,

> grepl("bc",x)
[1]  TRUE  TRUE FALSE FALSE

Manipulating String with Matched String Position

To get the first position of the matched pattern in the string, regexpr() is used.

>y<-"Waikiki"
>regexpr("ki",y)
[1] 4
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

Since the first match occurs at 4th character in y, the first value returned is 4. If there is no match it will return -1.

If you want to get this value only,

> regexpr("ki",y)[1]
[1] 4

You see that regexpr() returns two attributes “match.length” and “useBytes”. These value can be accessed by

> attr(regexpr("ki",y),"match.length")
[1] 2
> attr(regexpr("ki",y),"useBytes")
[1] TRUE

If you want to get positions for all matches, use gregexpr()

> gregexpr("ki",y)
[[1]]
[1] 4 6
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

To show the only values of positions, you need to use length function. It is a bit awkward but can be done.

>z<-gregexpr("ki",y)
> z[[1]][1:length(z[[1]])]
[1] 4 6

regexec() command works very similarly to regexpr(), however if there is parenthesized matching conditions, it will show both matched string position and the position of parenthesized matched string.

> regexec("kik",y)
[[1]]
[1] 4
attr(,"match.length")
[1] 3
> regexec("k(ik)",y)
[[1]]
[1] 4 5
attr(,"match.length")
[1] 3 2

To extract a substring from an input string, use substr()

substr(x,start, end)
>x<-"abcdef" 
>substr(x,3,5)
[1] "cde"

This function can also replace a substring in a string.

>substr(x,3,4)<-"XX
[1] "abXXef"

Another Way to Show Matched Strings Using regmatches()

I showed one way to list the matched string using sub() and grep() , you can do the same thing with regmatches together with regexpr() or regexec().
First, regexpr() gives you the position of the found string and the length of the mtached string in the input, you pass this information on to regmatches().  It will show all the matched strings from the input string. regexec() will show both matched substrings and matched substrings in the parenthesis.

> a<-"Mississippi contains a palindrome ississi."
> b<-gregexpr(".(ss)",a)
> c<-regexec(".(ss)",a)

> regmatches(a,b)
[[1]]
[1] "iss" "iss" "iss" "iss"

> regmatches(a,c)
[[1]]
[1] "iss" "ss"

The syntax of regmatches() is

regmatches(input, position&length)

Therefore, if you put position and length information of matched strings obtained from either gregexpr() or regexec() will be used to extract the matched string from the input. Note that regexec takes only the first match, you see only “iss” and “ss”.

Split Strings with Common Separator Using strplit Function

Suppose you have a date string “11/03/2031” and want to extract the numbers “11”, “03” and “2013”. Since the numbers are separated by the common character “/”, you can use strsplit function to do the job.

> strsplit("11/03/2013","/")
[[1]]
[1] "11"   "03"   "2013"

If you use “” for separator you can extract each character.

> strsplit("11/03/2013","")
[[1]]
 [1] "1" "1" "/" "0" "3" "/" "2" "0" "1" "3"

One thing you want to remember is when string starts with a separator, strsplit puts an empty character in the vector first.

> strsplit(".a.b.c","\\.")
[[1]]
[1] ""  "a" "b" "c"

If dot (.) is a separator, you need two backslashes for regular expression.

%d bloggers like this: