Archive | November 2013

MPI Tutorial for R (Rmpi)

In the previous two posts, I introduced what MPI is and how to install MPI for R programing language. Rmpi provides an interface necessary to use MPI for parallel computing using R. Rmpi is maintained by Hao Yu at University of Western Ontario and it has been around for about a decade now. Although it doesn’t have all commands found in original MPI for C/Fortran, quite a few functions have been added and it has most of basic functions for normal operations. The manual for Rmpi is provided here.

In this post, I am going to cover a few basic commands/functions for MPI using R.

Spawning Slave CPUs

In MPI term, master is the main CPU that sends messages to dependent CPUs called slaves to complete some tasks . When you spawn slaves using mpi.spawn.Rslaves(), first it gets the number of available CPUs by default setting (depending on your system). You can use nslave option to define the specific number of CPUs you want to use for MPI. You can use higher number than actual CPUs available in your system, but you will not get any benefit from doing it. It behaves as if it has the number of CPUs, but actual computation is done by available CPUs.

Screen Shot 2013-11-23 at 11.48.54 AM

Lets Execute A Command Using Slaves

There are several commands to execute codes in slaves. mpi.remote.exec() and mpi.bcast.cmd() are examples. The syntax for mpi.remote.exec() is

>mpi.remote.exec(cmd, …, simplify = TRUE, comm =1, ret =TRUE)

where cmd is a command to be executed on slaves, … is used as argument which will be used for the cmd, simplify is logical argument whether the results to be a dataframe if possible, comm is a communication number (usually 1), and ret is the logical value whether if you want results from executed code from slaves.  If you use  mpi.bcast.cmd() command to execute the following code, the slaves will execute the command but there will be no return values from them.

Let’s ask each slave to give back the slave number.

>mpi.remote.exec(paste("I am",mpi
.comm.rank(),"of",mpi.comm.size()))

$slave1
[1] "I am 1 of 11"

$slave2
[1] "I am 2 of 11"

........

$slave10
[1] "I am 10 of 11"

As you can see mpi.comm.rank() and mpi.comm.size() give the slave CPU number and total size of spawned slaves. The diagram below shows how this command is executed.

command-execution
Another example: asking each slave CPUs to compute the sum of 1 through their rank.

> mpi.remote.exec(sum(1:mpi.comm.rank()))
  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1  1  3  6 10 15 21 28 36 45  55

Measure Time to Compute

To see if your codes need to be paralleled, one can measure the time to compute the task. In R, proc.time() command returns three values and you can use this function to determine the time to compute.
1) user time: the CPU time charged for the execution of the user instructions of the calling process
2) system time: the CPU time charged for execution by the system on behalf of the calling process
3) elapsed time: the time since you logged in current account

Scalability Is Important

Increasing the number of CPUs doesn’t necessarily increase the performance. The overhead, an extra time needed to access the CPUs will increase with more CPUs for parallel computing.  Here is an example of the performance  of a simple code which computes the mean of 1million random numbers for 400 times. The performance increases dramatically from 2 to 4 CPUs (I), then the performance increases more slowly from 4 to 15 CPUs (II). Using More than 16 CPUs  takes more time to compute than 15 CPUs (III) and had no benefit of  doing so. Note: this code was run under sub-optimal interconnect network to show the effect of overhead. Results may vary dependent on your system. Under optimal condition, the time to compute should be halved if you double the number of CPUs.

library('Rmpi')
mpi.spawn.Rslaves(nslaves=4)
ptm<-proc.time() 
mpi.iparReplicate(400, mean(rnorm(1000000)))
print(proc.time() - ptm)

performance_n_CPU

If you use a large number of CPUs for computation, the overhead may significantly affect the overall performance.  So it is important to test your scripts on different numbers of CPUs for the optimal performance. The figure below is the actual performance by reserving 24 CPUs from a large computer cluster. All 24CPUs have high speed interconnect network, therefore performance doubles when number of CPUs doubled (e.g. 3->6 or 4->8). However using more than 10CPUs has no benefit of doing so.

optimal_performance

The commands I covered in this posts are all corrective call, which means that all slaves in a communicator are called for execution. I would like to cover more MPI commands to control individual slave in another post. 

There are three more commands before finishing today’s post. These are mpi.finalize(), mpi.exit() and mpi.quit(). mpi.finalize() should be called to clean all MPI states at the end of the script.  mpi.exit() will not only call mpi.finilize() but also detach the Rmpi library. If mpi.exit() is called, you need to relaunch R to load Rmpi library. mpi.quit() will quit MPI and leave R altogether.

Installing Rmpi (MPI for R) on Mac and Windows

MPI is message passing interface for parallel computing I described in the previous post. MPI is usually written in C or Fortran. Fortunately, you don’t need to know these programing language to use MPI. In this post, I am going to show step-by-step how to install MPI to use in R.

Installing Rmpi on Mac

Rmpi is a package to run MPI in R. Assuming you have already installed R, this site tells you how to install Rmpi for Mac OS X. There are several steps for this.

1) Get X codes, Command line tools, and home brew. Homebrew can manually install non-apple programs. If your system doesn’t have these, you need to spend some time to install them. Make sure you see the message below before next step.

Your system is ready to brew.

If you are running OS X 10.6.8 or earlier, the system may not be compatible with most recent X code. In this case, I recommend to upgrade your system to Mavericks, which is free (right now at least). Then you can install X code followed by command line tool for Mavericks.

Screen Shot 2013-11-17 at 9.04.36 PM

Once command line tool is installed, install home brew the same way as described there (step 2 and 3).

2) install gfortran and open-mpi

This part is pretty simple. Open command line window, then type the command to install gfortran and openMPI.

>brew install gfortran
>brew install open-mpi

3) install rmpi package in R
First download rmpi. Then install it on R64 (use local package directory for installation).

> install.packages("Rmpi",type="source")
Installing package(s) into ‘/Library/Frameworks
/R.framework/Versions/2.15/Resources/library’
……………………

If you have trouble compiling….. your compiler may be obsolete. When I installed Maverics on my old macbook, my compiler couldn’t compile from Rmpi source package. I needed to install macports for Mavericks, and then re-install open mpi.

If still doesn’t work, try earlier version of Rmpi  (e.g. 0.6-3) from here. Place the tgz file on your desktop, then in command line type “R”. Then type

>install.packages(file.choose(), repos = NULL, 
type = "source")

4) load rmpi in R

>library(Rmpi)

Installing Rmpi on Windows

Installation of Rmpi for Windows is once all requirements are met. It is necessary to install MPICH2 program to host parallel computing.

1) Install MPICH2 from here (select unofficial binary package for windows)
2) Change the PATH
3) Start hosting MPI
4) Download Rmpi and install on R
5) Start using Rmpi

What I am showing here is very similar to this website, but I am adding a little more information for each step.
If installation of MPICH2 is successful, you should see MPICH2 folder in the program files.

MPI_install_win1

2) To change PATH for MPICH2, right click “my computer” then select “property”. Click “Advanced system setting” on the left. Click “Environmental Variable”, then highlight “PATH”, click edit. In the small window, you will add “;C:\Program Files\MPICH2\bin” at the end of character string which is already there. Hit OK.

environmental varialble change3) To start hosting, you need to run the following command in administrator, to do in Windows 7 or Vista, go to Run on the menu and type cmd, then press CTL + SHIFT + ENTER. You will see a little window like this.

admin_command_line

Then type the following command.

>smpd -install
>mpiexec -remove

It will remove existing MPICH2 (if any) and install the program. -remove option will remove existing accounts and passwords.
install_MPICH
Now, you are going to register an account with a password with -register option

>mpiexec -register

It will ask you to type account name and password.You need a valid user name and password for the windows. Otherwise it will fail.
Then using -validate option to test if this process is successful. You should see SUCCESS.

4) Download Rmpi package here.  Select the most recent version at the bottom. Then go to R, and install package from package archive (using the downloaded zip file). Click brows and select the file you downloaded.

install rmpi on R

5) Load Rmpi package. Now it is ready to run MPI.

>library(Rmpi)

Testing If Rmpi Is Running Successfully

Let’s see you set up Rmpi correctly. Try typing the following:

> mpi.spawn.Rslaves()

It may take a few moments before it returns, however if it takes too long there is something wrong with the installation process. For desktop/laptop computer, the number of slaves spawn is the number of threads you have in your machine.

In the next post, I will discuss more about how to use the MPI installed on R.

Disclaimer: Installation process described here was tested on several computers both windows and macs, however, I cannot guarantee that the processes described above works for all computers. I am not responsible for anything that happens to your computer, please backup all your data before you try.

Parallel Computing: Introduction to MPI

What is MPI?

MPI stands for message passing interface, which enables parallel computing by sending codes to multiple processors. Basically, MPI is a bunch of codes which are usually written in C or Fortran and makes possible to run program with multiple processors. But there are several infrastructures for memory & multiple-CPUs. Most of desktop/laptop computers are multi-core (meaning multiple CPUs) with shared memory these days.

shared-memory-model

In this model, each CPU has an access to shared memory, so you can place a data set in the shared memory and divide the work to multiple CPUs. To run a program for the tasks using this kind of shared memory model, you can use OpenMP (different from OpenMPI). I am not going to discuss with OpenMP here, but maybe in the future posts.

Another type of CPU-memory infrastructure is distributed-memory model. In this model, each CPU has own memory and other CPUs cannot access directly to it.

distributed_memory_model

Advantages of distributed memory model is

1) CPUs don’t have to race, no waiting or synchronization is necessary.
2) Address of memory can be unified, easier to keep track address space.
3) Easier to design the machine

However, cluster computers were designed more like hybrid structure, meaning each node has shared memory structure but between nodes, memory is not shared and not accessible.

hybrid-model

Since I am more interested high-speed computing using cluster computer, MPI is the way to go for implementation of parallel computing.

What is OpenMPI?

MPI was originally developed by researchers from both academic and industry to standardize the portable message passing system. OpenMPI project is open source freely available implementation for distributed memory model and their software is completely free to use (unless you are trying to sell programs which use openMPI)!

Other MPIs

There were three MPIs developed by different groups. FT-MPI by University of Tennessee, LA-MPI by Los Alamos National Laboratory and LAM/MPI by Indiana University. Each MPI has its unique feature, and openMPI evolved by taking the best of each MPI and now it is updated much more frequently than these three MPIs and has become standard implementation for MPI.

How can I use it?

MPI is written in C or Fortran, its library is made up of ~200 routine functions. Fortunately, the library can be used in many languages such as C/C++, Fortran, Java, Python, MATLAB and R. The details of implementation of MPI is written in OpenMPI website @ http://www.open-mpi.org/

Simply download MPI for each programming language and install on your computer. You need   compilers for C/C++  or Fortran. If you are using Mac or Linux, simply configure, make and make install. For windows, use cygwin and install like linux environment.

It is the best to install on your local computer (desktop/laptop), then test your codes on it first before using on cluster because you can execute MPI codes for multiple threads on a single core computer. Debugging can be more straight forward.

In the next post, I will demonstrate installation and running MPI using R.

Regular Expression Tutorial 2: Commands in R

The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to write a regular expression to match a string, you may want to manipulate strings such as deletion or replacing. Here is the list of string matching &manipulation commands commonly used with regular expressions in R. These commands also appear in many other languages.

Command          Function
grep( )          Return index of the object 
                 where reg exp found the string
grepl( )         Return logical values for reg exp 
                 matching 
regexpr( )       Return the first position of found
                 string by reg exp
gregexpr( )      Return all positions of found string
                 by regexp
sub( )           Substitute a pattern with a given string
                 (first occurrence only)
gsub( )          Globally substitute a pattern with a 
                 given string (all occurrences) 
substr( )        Return the substring in the giving 
                 character positions (start and stop)
                 in given string
strsplit( )      Split the input string into parts 
                 based on another string (character)
regexec( )       Return the first position of matched 
                 pattern in a given string
regmatches ( )   Extract or replace matched substrings
                 from match data obtained by gregexpr,
                 or regexec

Find & Display Matching string: grep

grep(pattern,vector) 
>x<-c("abc","bcd","cde","def")
>grep("bc",x)
[1] 1 2

The first one is grep() command, which was originally created in Unix system. Its name came from globally search a regular expression and print. You see “bc” appears in the first two entries of x. grep() function returns indexes of the matched string. If you want to show the matched entries (not index),  use value option  or  use square brackets.

>grep("bc",x,value=TRUE)
[1] "abc" "bcd"
>x[grep("bc",x)] 
[1] "abc" "bcd"

Show Matched Pattern Using Find & Replace

If you want to get only the matched pattern, it is kind of awkward but you can use the output above and remove the unmatched part (In linux, you just use grep -o).

First, sub function’s syntax is

sub("matching_string","replacing_string", input_vector)

This function works like “find and replace”. Using this to remove unmatched part.

> sub(".*(bc).*","\\1",grep("bc",x,value=TRUE))
[1] "bc" "bc"

Remember .* means any character with any length and \\1 means the matched string in the first parenthesis. In this case, you see only “bc”, but if you use regular expression for pattern, you will see different kind of matches found in the string.

Remove Matched String

If you want to return indexes of unmatched string, add invert option.

> grep("bc",x,invert=TRUE)
[1] 3 4

Combining with value option, you can remove matched string from the vector

> grep("bc",x,invert=TRUE, value=TRUE)
[1] "cde" "def"

If the search is not case sensitive,

> grep("BC",x,ignore.case=TRUE)
[1] 1 2

If you want to get logical returns for matches,

> grepl("bc",x)
[1]  TRUE  TRUE FALSE FALSE

Manipulating String with Matched String Position

To get the first position of the matched pattern in the string, regexpr() is used.

>y<-"Waikiki"
>regexpr("ki",y)
[1] 4
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

Since the first match occurs at 4th character in y, the first value returned is 4. If there is no match it will return -1.

If you want to get this value only,

> regexpr("ki",y)[1]
[1] 4

You see that regexpr() returns two attributes “match.length” and “useBytes”. These value can be accessed by

> attr(regexpr("ki",y),"match.length")
[1] 2
> attr(regexpr("ki",y),"useBytes")
[1] TRUE

If you want to get positions for all matches, use gregexpr()

> gregexpr("ki",y)
[[1]]
[1] 4 6
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

To show the only values of positions, you need to use length function. It is a bit awkward but can be done.

>z<-gregexpr("ki",y)
> z[[1]][1:length(z[[1]])]
[1] 4 6

regexec() command works very similarly to regexpr(), however if there is parenthesized matching conditions, it will show both matched string position and the position of parenthesized matched string.

> regexec("kik",y)
[[1]]
[1] 4
attr(,"match.length")
[1] 3
> regexec("k(ik)",y)
[[1]]
[1] 4 5
attr(,"match.length")
[1] 3 2

To extract a substring from an input string, use substr()

substr(x,start, end)
>x<-"abcdef" 
>substr(x,3,5)
[1] "cde"

This function can also replace a substring in a string.

>substr(x,3,4)<-"XX
[1] "abXXef"

Another Way to Show Matched Strings Using regmatches()

I showed one way to list the matched string using sub() and grep() , you can do the same thing with regmatches together with regexpr() or regexec().
First, regexpr() gives you the position of the found string and the length of the mtached string in the input, you pass this information on to regmatches().  It will show all the matched strings from the input string. regexec() will show both matched substrings and matched substrings in the parenthesis.

> a<-"Mississippi contains a palindrome ississi."
> b<-gregexpr(".(ss)",a)
> c<-regexec(".(ss)",a)

> regmatches(a,b)
[[1]]
[1] "iss" "iss" "iss" "iss"

> regmatches(a,c)
[[1]]
[1] "iss" "ss"

The syntax of regmatches() is

regmatches(input, position&length)

Therefore, if you put position and length information of matched strings obtained from either gregexpr() or regexec() will be used to extract the matched string from the input. Note that regexec takes only the first match, you see only “iss” and “ss”.

Split Strings with Common Separator Using strplit Function

Suppose you have a date string “11/03/2031” and want to extract the numbers “11”, “03” and “2013”. Since the numbers are separated by the common character “/”, you can use strsplit function to do the job.

> strsplit("11/03/2013","/")
[[1]]
[1] "11"   "03"   "2013"

If you use “” for separator you can extract each character.

> strsplit("11/03/2013","")
[[1]]
 [1] "1" "1" "/" "0" "3" "/" "2" "0" "1" "3"

One thing you want to remember is when string starts with a separator, strsplit puts an empty character in the vector first.

> strsplit(".a.b.c","\\.")
[[1]]
[1] ""  "a" "b" "c"

If dot (.) is a separator, you need two backslashes for regular expression.

%d bloggers like this: