Archive | File Transfer RSS for this section

Transfer Large Files on Web to Cluster

When you find large files on the web and need to transfer to a computer cluster, what do you usually do? You can download the files to your desktop first, then transfer them to the cluster using Globus online.  It works, but it is a cumbersome 2-step transfer and requires space on your desktop too. Once transferred, you need to delete the files. In this post, I am going to present two methods to simplify this process.

Mount Your Cluster Account on Mac Desktop

Install FUSE plug-in

We are going to use SSHFS to mount drives. To do it, you need both FUSE plug-in (OSXFUSE) & SSHFS. Both programs can be obtained here. Once you installed the program, you will see an icon for OSXFUSE on system preference window.

FUSE

Create a Directory to Mount the Drive

You can create a directory any where you want. Let’s create a folder called “Cluster” on your desktop.

folder1

Run a Command in Bash

Open terminal, and type

>sshfs -h

If the plug-in is properly installed, you should see

general options:
-o opt,[opt...] mount options
-h --help print help
-V --version print version
.....
.....

no mount point

Then type the following to mount

>sshfs your_cluster_account@abcd.edu:/path/to/cluster/directory /path/to/Desktop/directory

Make sure you have the right path for both the cluster and local directory. If all is correct, you are prompted for a password.

If it is successfully mounted, the folder you created on the desktop will change appearance once you mount the drive to it.

folder2

Caution: You mount the drive always on an empty folder. Anything in the folder will not be accessible once you mount another drive to it.  For example, if you mount a drive on Mac HD, you will likely lose access to most of files, and crash the system.

Try Saving Something

Save the file you want to save by clicking “Save Link As…” You then select the cluster folder on your desktop. It will start saving the file in your cluster account. The image below is an example for downloading compressed fastq files from illumina website.

save

When you mount the drive, make sure to use an empty directory.  If you use a directory that contains files, you will lose access to them. If you accidentally mount on the wrong place, you will need to unmount the drive.

Unmount the Drive

If you made a mistake mounting the drive in a wrong directory or the mounted drive stops responding, you can unmount the drive and remount it. To unmount,

 >umount /path/to/the/mounted/directory

Alternatively, Use a Web Browser on your Cluster to Save

It is not uncommon for unix/unix-like clusters to have web browsers these days. I don’t usually use them to browse the internet because it is kind of slow. However, it requires no installation of software or mounting disk to save files from the web directly to your cluster account. Some people may prefer this.

Ask your Admin to Find which Browser is Available

There are a few common web browsers for linux system. Here is a list of top 10 browsers. In my institution, I found firefox is available on the cluster. To use it, I log in using ssh -X then type firefox. A window will pop up for firefox. Once the window is open, you can use it just like your browser on your local machine, but with slower speed. Because you are using a browser through a cluster, you can save web files directly to your account.

Screen shot 2014-02-05 at 10.36.35 AM

Both methods work pretty well to transfer large files from the web. In my case, the second method was faster than the first method. You can try it and decide which method you like.

Addendum on Mar 10th, 2014

One of my coworkers suggested a Mac application for mounting drives called MacFusion. It is easy to set up. Only things need to be preinstalled are again these two programs.

1) FUSE for mac os X
2) SSHFS

Download from here

MacFusion download

Screen Shot 2014-03-10 at 10.54.51 PM

Set up needs pretty much the same information as above and most of people will have no problem using them if above methods are worked.

Command Line Interface of Globus Online

This week I am exploring command line mode of Globus online. Web-based Globus online is very easy to use and there are enough features for normal use. However, there are some limitations, for example you cannot change file names when you transfer. Command line mode enables more fine tuning of transfer and allows to modify details. Another example is you can specify name for each transfer, so you can keep track of each task more easily.

I am going to do this using Linux terminal but If you are window environment, one way to do it in command line is to install Cygwin. This software provides Unix/Linux like environment, so that you can run similar commands to Unix/Linux. Those who are already using Linux environment, it is not necessary to do anything. You can go to linux terminal and start there.

After installing Cygwin on windows machine, you go to Cygwin directory (usually C:\cygwin) and edit  Cygwin.bat file.

@echo off

C:
chdir C:\cygwin\bin
set CYGWIN=binmode ntsec
bash --login -i

After editing, save and double click Cygwin.bat file to run the program.  It will open a command line terminal you see below. Then type

cygrunsrv -h

If Cygwin is successfully installed, you will see options for cygrunsrv command.

ssh_test

CONNECTING USING SSH

OK, from here I will be doing everything in Linux terminal.  A lot of details are provided here (intro) , here (getting started) and here (beyond basics). So please refer these sites if you need more information. I am also assuming you already have user ID and several endopoints activated for Globus online. For the first time, you need to generate SSH Keys.

>ssh-keygen -t rsa -b 2048

It will generate a key in the file name called id_rsa.
It will also ask you to enter passphrase. Please remember what you type in.

Generating public/private rsa key pair.
Enter file in which to save the key 
(/home/user_name/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):

Open id_rsa.pub file and copy the entire content. Then go to Manage Identifies.Click “Add SSH Public Key box”. Enter alias (name) and paste the key in SSH Public Key. Click “Add SSH Key”.
public_key

Now, go back to Linux terminal, then try connecting your Linux to globus online. The format for connecting to globus online is

ssh globus_username@cli.globusonline.org

You may see an error message

Permission denied (publickey,gssapi-keyex,gssapi-with-mic
)."

This error is fixed by changing permission of this file to “read only”.

 chmod 400 /path/id_rsa.pub

Try ssh command again, this time hopefully you will see well come message after ask you enter passphrase that you specified above.

Welcome to globusonline.org, user_name. 
Type 'help' for help'
$

Now your machine is connected to globus online. Note that you see a ‘$’ on command line prompt.

TRANSFERRING FILES

Let’s try transferring some file using command line. The basic format for transfer is

transfer -- user_name#endpoint1/path/to/source/file
 user_name#endpoint2/path/to/destination/dir

Once the command is executed you will see a message.

Task ID: 26116978-2703-11e3-99f8-12313d2005b7
Created transfer task with 1 file(s)

There are a number of options you can use. To see all option, go here or type transfer -help.

If you want to change file name after transfer, you can put the file name in the destination directory.

RENEWING CREDENTIALS

You cannot transfer files if your credentials for the endopoint is expired. You need to renew credential if it is the case, type

endpoint-activate

This will prompt to enter username and password for each endpoint you have an access to.

If you want to activate specific endopoint, type

endopoint -m myproxy_server

It will prompt to ask you to enter username and password for this proxy server.

OTHER COMMANDS

There are a number of commands you can use in command line. mkdir can create a new directory. rename command can change the name of file or directory. ls is to show the content of remote server. Please refer here for more details.

If you want to quit command line mode of globus online, simply type,

quit

Transferring Files to Your Own Server with Globus Online

I have been benefited from Globus online a lot as I have many files to search for mass spec everyday on computer cluster in my institution. In this post, I want to explore how to set up own server to send files back and forth from your desktop PC. This will be useful in general sending relatively large files from one place to the other.

First I am assuming you have a server computer you have full access to. In my case, I have a server at home running ubuntu 10.04.02.  You need to discover the right distribution of globus-connect-multiuser program. You can see the list from here. The instruction for installation of globus connect multiuser is written here and please use it as guidance.

In my case, I couldn’t see the one for ubuntu, so I asked Globus team. They told me I should use “globus-repository-5.2-stable-lucid_0.0.3_all.deb”. Here is the steps to configure server for globus online multiuser.

1) Download package
>sudo curl -LOs http://www.globus.org/ftppub/gt5/5.2/stable/installers/repoglobus-repository-5.2-stable-lucid_0.0.3_all.deb

2) Intall Debian-based distribution>sudo dpkg -i globus-repository-5.2-stable-lucid_0.0.3_all.deb

3) Get update
>sudo aptitude update

4) Install globus-connct-multiuser> sudo aptitude-y install globus-connect-multiuser

5) Update configuration file. This file is present in /etc/ directory. To modify you need to have permission
>sudo vim /etc/globus-connect-multiuser.conf

configuration_globus_multiuser

There are quite few things you need to change in order to get it work.  What I am going to show here is a minimum setting. For more detailed setting, please consult Globus online customer service.
First, you need to change following lines. Note: you need to remove % and s and semi colon (;) for the lines you need to configure.

L11  User = user_name_you_use_to_log_in_globus_online
L16  Password = your_password_for_globus_online
L22  Endpoint = same_as_User
L29  Name = server (whatever you want to call your server)
L103 Server = XXX.XXX.XXX.XXX  (the server’s IP address)
L112 ServerBehindNAT = True
L193 server = XXX.XXX.XXX.XXX  (the server’s IP address)

6) Run the installed program. This will take a few moments to be in effect
>sudo globus-connect-multiuser-setup

7) Check if essential ports are open (LISTEN). Type sudo lsof -i

Screen Shot 2013-09-06 at 10.50.28 PM
Pay attention to the far right column. These are the status of ports currently used in your server. You can see port 7512 is open (LISTEN) for Myproxy, and gsiftp is also open (LISTEN). If you want to know the port number for gsiftp, you can look up in the configuration file.

>vim /etc/services
Screen Shot 2013-09-06 at 10.48.19 PM
T
his shows only the part of the file, but you can see port 2811 is used for gsiftp. Now ports are open for globus connect multiuser. But you need to make sure the ports are accessible (open) from remote computer. This site is easy to test whether certain ports on your server is actually open or not. You can simply type the IP address and port number (7512 and 2811). If it says ports are closed, you should check if portforwarding is correctly set on your router.

7) Go to globus online website. Log-in and go to Manage Data, and click manage endopoints. Here you are going to add your server.

add_endpoint

Enter Endopoint Name : username#server
Choose Myproxy for Idneify Providers. Hostname should be the same IP addresses used above.
Leave Server DN empty.
Server Domain should be the same IP address used above.
Keep the default server port: 2811
Hit the Create Endpoint button. Then click the  activate tag and hit activate now button.  Now it will ask you to enter User name, Passphrase, Server DN and Credential Lifetime. Enter username and passphrase used to log-in your linux server. You can leave the Server DN empty and put some numbers (e.g. 24) for crediential lifetime. Then you will see an error message. Copy the text after MYPROXY_SERVER_DN=, go to Server and paste into the Server DN (no double quotations).  Hit Save.
Screen Shot 2013-09-06 at 11.03.32 PM
Try entering the linux user ID and passphrase, then activate again. This time it should be activated.
Screen Shot 2013-09-06 at 11.06.41 PM
Now your server is activated and ready to transfer files. Go to Manage Data and click start transfer. Then enter your Endopoint for your server and click Go.  Now you need to enter again your user ID and passphrase for linux server and credential time.

Screen Shot 2013-09-06 at 11.15.07 PM

Once everything is successfully configured, you should see directory in the window. Now you can start transfer your files. Essentially this is to set up FTP server but you can transfer files with much faster speed.

Initially I had a problem transferring files. I saw directory structures on both sides, but when I initiated transfer, the transferred files had no contents. I could create and delete files & directories, but transferring files were unsuccessful. If you encounter a similar problem think about these possibility.

1) Linux firewall and/or your router firewall is blocking
2) Port forwarding is not set up correctly on your router

If firewall is blocking certain port, it may cause trouble sending files. Remember, globus-connect-multiuser uses port 50000-51000 by default. In my case, 2) was the problem. My router has port forwarding setting, but it separates specific port forwarding and port range forwarding. Once I fixed it, it works flawlessly.

If activation of server, connection to the server and port setting are done correctly, globus connect allows transferring files between your PC and server. If you look at port usage, you will see a new connection is established.

after_initiating_transfer

Solution to Trasferring Extremely Large Files

Bandwidth is a communication speed (bit-rate) when you access data to computer resources. Usually, it is expressed in bit/s, kbit/s, Mbit/s or Gbit/s (these are bits not bytes: 8bit=1byte).  If you have home internet access, you usually pay for the speed. In the case of Timewarner cable, you pay $19 for 1Mbps, $29 for 3Mbps, for $34 for 15MBps per month and so on . One Mbps is 1 mega bit per second, which is 125Kbyte/sec and usually this defines the maximum transfer speed of the data.

As I work at university, I have a pretty fast connection (100Mbps=12.5mega byte per sec). But I hardly get such speed for any transfer. For example, if I use SCP, the transfer rate is usually 80-200kbyte per sec which is about 1-2% of maximum speed. Even with Globus online service, I get about 1-1.5MByte per second. At this rate, transferring 1GB file takes about 10 min. Not so bad, but if I want to transfer 100GB, it will take 16 hrs+. I don’t know if this is tolerable hours to you… It depends on your situation. Certainly, if you wan to move 1TB, this will take 160hrs =6.7 days. The time is assumed if the transfer is done flawlessly. Obviously, you rather send 1TB drive using FedEx overnight instead of using Globus online in this case.

Why can  we use only a fraction of bandwidth for data transfer?

The problem of using FTP or HTTP for data transfer is you are relying on TCP (transmission control protocol). TCP or TCP/IP provides reliable, ordered error-checked delivery over the net. But it is a very very slow protocol. There is an inherent problem of transferring data in the long distance using TCP. It loses packet more frequently and therefore, the speed gets slower.

The sender of the TCP packet has to receive the acknowledgement from the receiver before it sends more data. When the acknowledgement is not received, the sender slows down the transfer try to avoid congestion (even there is no congestion).

Solution to the slow transfer

Aspera is a company which  provides a solution to bottle neck issue using TCP for transfer. Their program is completely independent of network delay and suffers little packet loss even in the long distance (e.g. inter-continent). The program is called fasp, which uses a new large data transfer protocol. In this protocol, even at 10% packet loss, it achieves 90% utilization of bandwidth with minimum redundant data transfer.

It is really fast

They claim the transfer speed is up to 1000 times of standard FTP. The benchmark on their website showed below

aspera_benchmark

You see it if you use their program you can transfer the large files at blazing speed. If you have the fastest bandwidth, 100GB file transfer takes only 1.4 min. Wow!!

 Is it really true?

I tried downloading a few files using Aspera program. Using their program to send your own files is not free, but I can download files from the site that uses Aspera can be free (if they don’t charge for downloading files).  Clinical Proteomic Tumor Analysis Consortium (CPTAC) has data collection of proteomic research that can be freely downloaded from their website. When you download files on their website, you will be able to use Aspera program plug-in for free.

CPTAC_site

It is pretty fast

In attempt to downloading 700MB files (total), it took about 2 min. You can see with my bandwidth of 100Mbps, it uses 43% of total capacity for downloading. 43Mbps is 5.4Mbyte per second and this is 50-100 times faster than FTP and 5 times faster than using Globus online. At this speed, I can download 100GB file under 5 hours. It doesn’t seem to be able to use 90% capacity of bandwidth I have, but it is still significantly faster. If I have gigabit per second connection, this should be done less than 1 hr.  I can certainly see the advantage of using their program for very large file transfer. In fact, large companies such as Netflix uses Aspera as they need to transfer large amount of data everyday.

aspera_transfer

How much does it cost?

It is not free, unfortunately. The good service doesn’t come for free. I tried searching on web for pricing and found that it charges by hour. I guess if you are frequent user, it needs to justify the cost. But if you have very fast connection and have lots of data to transfer, this could be the solution.

aspera_price

Big Data Transfer Using Globus Online

One of the major problems in dealing with large data is to transfer files via network. Mass spec files files are fairly large, often exceeds gigabytes. Computation of mass spec results may take some time, but file transfer can take longer than computation in some cases. If a mass spec file is transferred simultaneously, or as soon as mass spec run is over, total amount of transfer time is similar to the mass spec run time (I am assuming that your network connection is faster than generating mass spec raw data).  For a few file transfers, SCP/FTP is ok, but large file transfer is better with Globus online.

Screen Shot 2013-07-08 at 7.08.42 PM

What this site does is to transfer files (especially large ones) across the internet fast, and reliably. You can also share files with multiple users. Operation is as simple as Dropbox.

First, you create an account for yourself  at Globus online by clicking “sign up now”. After verifying your account with email, you can immediately start using their service. There are things you can do it for free such as:

1) File transfer and synchronization to/from servers

2) Create private and public endpoints

3) Access to shared endpoints created by others

Things that are NOT free ($7/mo or $70/yr):

1) Peer-to-peer transfer and share files

2) Create and manage shared endpoints

As far as I understand, if you have university server which has already signed up for Globus, you can create an endpoint on your computer and start transferring files between the registered server and your PC without any charge.

These are the steps to use Globus online for file transfer.

1) Sign up for a new account
2) Log on to Globus online, and go to  “Manage Data” on the top of the page, and select “manage endpoints”.
3) Click “add Globus Connect”.
4) Type endpoint name (e.g. myPC) and click “generate setup key”.
5) Select the operating system you are using (Mac, Linux or Windows).
6) Install globus online on your computer (automatic).
7) Open Globus Connect application.
8) Copy the setup key and paste into the box in Globus Connect app. Hit OK.
9) Click “start transfer” under “Manage Data” tab on the top.
Screen Shot 2013-07-08 at 7.24.18 PM

10) Type your endopoint name (account_name#computer_name) in the Endopoint box (either left or right). Then click “Go”.
11) Type the server endopoint name in the other Endopoint  box. Then click “Go”.

Once you see file structures for both sides, you can start transferring files. Select multiple files by pressing CTRL key, then hit the arrow head to send the files.

Once you started, you can see the job in the transfer activity page. Real advantage of transferring files using Globus online is

Not only you have faster transfer over the internet, you don’t have to log in to the computer you want to transfer files from. This means, you have three computers (A, B &  C), and you can direct transferring files from A to B or vice versa using Computer C.

You can also quickly synchronize the directory by clicking an option and check “only transfer new or changed files”. You can further select an option how you want to define new or changed files.

Screen Shot 2013-07-11 at 8.53.38 PM

I also put a link to Globus online manual here.

Note: I had a trouble with sending files from my computer initially and found that my computer was in the part of University Hospital network which restricts ports used by Globus online. Since it was not possible to ask network admin to open a port for Globus online, I had to use VPN to mitigate the issue.

Overall, I am pretty satisfied with the speed of transfer. Globus online is ~5x faster than SCP.

%d bloggers like this: