Bioinformatics tools for VCF files


With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.

So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and…

View original post 215 more words

bigWig files

Genomics data are so abundant nowadays. The chances are someone has conducted similar experiments to yours. In fact, we should research hard before even thinking about running new samples because the last thing you want to know is that someone has published the data before you did.

Maybe one day your boss says he found an article xxx.  He says they did an interesting experiment and asked you to find out how the genes of interest behaved in their dataset.

So you went to the article. However, the article has no supplemental excel data sheet. You look hard and read the article to find out if they reported formatted data somewhere.

Only thing you see is GEO accession number of the study. So you search on google or went to GEO website. Then type the GSE accession number. You scroll down the page and found data are available on ftp server. But wait… it says bigWig file, what is this?

What is bigWig file?

On UCSC website it says following “The bigWig format is for display of dense, continuous data that will be displayed in the Genome Browser as a graph.”.

Not clear? bigWig files are indexed binary files which contain alignment data to the genome. It is similar to bed file which contains location of the genome where the read aligned to. Because it is indexed, it can be used to quickly find the location you want.

Visualize pile-up data on UCSF Genome Browser

To actually look at bigWig data, you don’t need programming skills. If you are using bigWig files found on GEO website, you DON’T need to DOWNLOAD the files. This is the key point-, UCSF GB can find the gene of interest very quickly and efficiently from bigWig files. Because it is indexed, the search is fast and no need to upload the entire file.

Step1: Get the URL for FTP server

When you download bigWig data, you click the FTP link on the GEO website. Screen Shot 2014-12-26 at 11.07.02 PM

You will be asked if you want to log in as guest or registered user. Log in as guest.  It will pop up a window with a folder which contains SRA files associated with the study. Open the folder, then right click (control + click for mac) and select “Get Info”.

Screen Shot 2014-12-26 at 11.13.38 PM

Now you copy the server URL.

Screen Shot 2014-12-26 at 11.26.43 PM

 Step2: Create A Track File

Track file is an essential file that will be used in the UCSC genome browser to hold the key information about the SRA files you want to visualize. To create, you open a text editor and type the following minimum information: track type & description, bigDataUrl. track type is “bigwig” and a concise and understandable sample name should be chosen for name. For description, you can add more detailed sample information, but you can leave as the same as sample name. Finally, bigDataUrl should be the ftp server address. See the example below.

Screen Shot 2014-12-29 at 12.13.44 PM

If you want to know more about track file. There is a  detailed tutorial for generating your own session here.

Step3: Upload A Track File

Once you created a track file, you need to upload it on the Genome Browser.

First, pick the right genome you want to use. In this example, I chose “insect” in clade, “D. melanogaster” in genome, and “BDGP R5/dm3” in assembly.

There are two ways to upload the file, either copy and paste the track file info into the “PASTE URLs” box or click “Choose File”, then “Submit”.  If there is no error in your file, you will see uploaded information in the window like below.

Screen Shot 2014-12-29 at 12.19.02 PM

Here I successfully uploaded the data for 6 samples. Then click “go to genome browser”.

Screen Shot 2014-12-29 at 12.21.11 PM

 Since I didn’t add any specific position of the genome for visualization, it uses the default position. You can change it by typing the actual position of the genome or simply your gene name of interest. I also note that there are so many things you can display, however many of them may not be necessary. Go change the items that you don’t want to see by changing the status of the item to “hide”, then click “refresh”.

I also want to point out that I used “dense” instead of “full” to show the aligned reads in the screen. To see the pile-up, just change it to “full” for the samples you want to see.

Step 4: Sharing it with your collaborators

What you see here can be shared with anybody you like. I would create an account if you don’t have any. Then log in from here.

Once you log-in, there are several options to share the session with someone else.

Screen Shot 2014-12-29 at 12.30.21 PM

 The first thing to save the session. Type name and click “submit”

Screen Shot 2014-12-29 at 12.32.38 PM

Now, you click “Email” to send the link by email. Alternatively, save the current setting to a local file. Then give the file to someone you want to share with. Once he/she gets the file, he/she goes to the same site, and just upload the file in the section “Use settings from a local file”. I prefer the second method, as sometimes when I used the old link, the results were not identical to the original ones.

My favorite commands Part3: ‘sweep’ function in R

In the last two posts, I described useful commands either to expand or reduce your data. In R, “apply” function is used to apply a function you specify to the data frame. For example, you can use “mean” function to calculate mean values for each column and row.


> data
 [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12

> colMean<-apply(data,2,mean)
> colMean
[1] 5 6 7 8

> rowMean<-apply(data,1,mean)
> rowMean
[1] 2.5 6.5 10.5

You see that to apply mean function by column-wise manner, you use “2” in the second argument For row-wise mean, you use “1”.

The third argument in “apply” can be a user-defined one, as long as it takes the right data.  For example,

for (i in x){t=t*i}

This function multiplies all the numbers in x and return the value.

> multCol<-apply(data,2,myfunc)
> multCol
[1] 45 120 231 384

“apply” function is useful to this kind of simple column-wise or row-wise calculation, however it would be cumbersome if you want to add different values for each column/row.

sweep function in R

R has a convenient function to apply different values to data in different columns/rows. For example, you want to subtract “3”, “4”,”5″ ,”6″ from each value in the first, 2nd, 3rd and the last column. You can do this by simply applying sweep function.

> sweep(data,2,c(3,4,5,6),”-“)
[,1] [,2] [,3] [,4]
[1,] -2 0 2 4
[2,] -1 1 3 5
[3,] 0 2 4 6

Basic syntax of sweep function is like this

Screen Shot 2014-08-12 at 10.19.59 AM

sweep takes four arguments to work. The first one is the data you want to modify. The second one is the margin, if it is 1 then function is applied in row-wise manner. If it is 2, it is applied in column-wise manner. This is the same as “apply” function. The third argument is a vector with the same length as column or row. If the second argument is “1′, then the vector has to have the same length as row. If it is “2”, the length of vector needs to be the same length as column. The last argument is a binary operator, such as  – ,  + , * , / , < and >.

Standardizing data matrix using sweep

Gene expression data usually have large ranges and it is sometimes useful to standardize the data to see which genes go up or down more dramatically than others. The mathematics is pretty simple:

Screen Shot 2014-08-12 at 11.12.13 AM

You have normalized expression data first,  calculate median expression for each gene and then subtract the value and divide the results by median absolute deviation. Since we are subtracting median expression which differs for each gene, sweep can come into play here. By standardizing, all genes are centered to their median expression values and expression change is also normalized by dividing with median absolute deviation. In this way, gene expression changes are not biased by expression levels.

Here is the code:

standardize <- function(z) {
rowmed <- apply(z, 1, median)
rowmad <- apply(z, 1, mad)  # median absolute deviation
rv <- sweep(z, 1, rowmed,"-")  #subtracting median expression
rv <- sweep(rv, 1, rowmad, "/")  # dividing by median absolute deviation

Test this function

      sample1   sample2    sample3   sample4
gene1 -1.011736 -0.3372454 0.3372454 1.011736
gene2 -1.011736 -0.3372454 0.3372454 1.011736
gene3 -1.011736 -0.3372454 0.3372454 1.01173

There is a pretty similar function called “scale” in R, but it is slightly different from this function. If “scale” function is used, you will get:

> scale(t(data),scale=T)
             gene1      gene2      gene3
sample1 -1.1618950 -1.1618950 -1.1618950
sample2 -0.3872983 -0.3872983 -0.3872983
sample3  0.3872983  0.3872983  0.3872983
sample4  1.1618950  1.1618950  1.1618950

Note that you need to transpose the data because “scale” function works column wise-manner. You see that values are not the same because “scale” function uses standard deviation instead of median absolute deviation. To get identical results, you use a vector with median absolute deviation in the second argument for “scale” function.

> scale(t(data),scale=apply(data,1,mad))
gene1 gene2 gene3
sample1 -1.0117361 -1.0117361 -1.0117361
sample2 -0.3372454 -0.3372454 -0.3372454
sample3 0.3372454 0.3372454 0.3372454
sample4 1.0117361 1.0117361 1.0117361

My favorite commands Part2: Data reduction using ddply

In the previous post, I showed how to split the cell by a delimiter and copy other cells to expand the spread sheet. Sometimes you want to do the opposite, which is to reduce the number of rows which contain common IDs. For example, I have a spreadsheet that contains gene ID and length and GC content of 3’UTR for rat. As you know there are many genes with various 3’UTRs, therefore you have multiple rows for each gene. Now I may want to make a summary of 3’UTR length and GC content by averaging. Maybe I want to create a column with ID, min length, avg length, maximum length and so on. In this post, I am going to show you a simple example of how to reduce data by applying function to multiple rows with common IDs.

Screen Shot 2014-07-27 at 8.45.40 AM

You see that first column entrez ID is an identifier which appears multiple times in the data. The data were sorted, but it doesn’t have to be. The goal of simple task is to create a new data frame with ID, average of length (2nd column) and GC content (3rd column).

Screen Shot 2014-07-27 at 8.45.45 AM

How do you accomplish this task? There is a very convenient command in R called ddply. Let’s install and load package called plyr.


The help page of ddply command shows

ddply(.data, .variables, .fun = NULL, ..., .progress =
 "none",.inform = FALSE, .drop = TRUE, .parallel = FALSE,
 .paropts = NULL)

They key is the first three arguments. The first one is the dataframe you want to  work with. The second one is the column name to get rows with the same information. In the example above, that would be “entrez ID”. The third argument, function, is the most critical one. The command will first look at the top cell of the column specified by the second parameter. Then it will find all rows with the same information. For example, assumethe following  ddply function is called,


If data1 is the dataframe of the example on the top of page, ddply will grab the first entry in the second argument “entreZID”, which is “24244”. Then it will look for other rows that also contain “24244” in entrezID. In this example, there are two rows with the same “entrezID”, it will send a dataframe with theses two rows to the function of the third argument. Screen Shot 2014-07-27 at 4.02.09 PM

So you need to create a function which takes a dataframe like this  and returns a dataframe with average data below. Screen Shot 2014-07-27 at 4.04.24 PM

Let’s create a function called calavg


This simple function takes a dataframe and applies column-wide  mean function. The results are stored in a dataframe “avg”, which will be returned upon completion. So once you call ddply script, you should get:

> ddply(data1,.(entrezID),calavg)
entrezID length_3utr GC_3utr
1 24244 1789.5000 0.4593695
2 24346 157.0000 0.4076433
3 24356 3394.5000 0.4298199
4 100912619 114.3333 0.4249794

Writing a proper function is the most critical and it is not hard to implement more complicated methods for data reduction. For those who want to explore more about ddply, please refer below.

My favorite commands Part1: Split and copy using AWK

I haven’t updated my blog for while…… I have been pretty busy since the beginning of 2014 but that was my excuse, shame on me.  Some people told me my blog is useful and that gave me motivation to update. Let’s do it!!

Before I started extensively using R or linux command line  as a main tool for manipulating data, my main data analysis software has been microsoft excel or access program. They do work for many things, but certain things are more difficult to implement.  For example,  you have data like this

Screen Shot 2014-07-18 at 1.12.14 PM

These are real data I got for 3’UTR sequences from Rat. Notice that you have 4 columns and the last column has Entrez ID where two cells contain multiple IDs separated by semicolon. The task is to split the Entrez ID column for individual IDs and create new row(s) with the first three columns. So the final data looks like this.

Screen Shot 2014-07-18 at 1.08.32 PM

Do you see what I want to accomplish? Colored parts are identical, notice that Entrez ID column has only single entry now. Before, IDs in a single cell were separated by “;”. After split, each entrez ID has its own row. This can be done in two command line scripts in linux/mac terminal.

1) Add “;” between the third and fourth column.  More precisely, after the third delimiter.

>awk -F “\t” ‘{print $1″\t”$2″\t”$3″\t;”$4}’ rat_3utr.xls > temp

2) Split Entrez IDs and copy the first three columns

>awk -F “;” ‘{print $1″ “$2}{for (i=3;i<=NF;i++)print $1” “$i}’ temp > rat_3utr_single_ID.xls

3) remove temp file

>rm temp

 Syntax of AWK using -F option

AWK is extremely useful command to manipulate text. AWK is available in both linux and mac command line without installation.  What it does is you specify the pattern of string, and action(s) provided on matched strings. With -F option, you specify delimiter {e.g. \t for tab}, then the awk command will split the each line from the input file  into the parts separated by the delimiter. The parts are stored in $1, $2, $3 … in the order. The total number of parts are stored in NF. Here is the basic syntax of AWK with -F option

awk -F delimiter” ’{action 1}{action 2}{…}{action n}’ input_file

So 1) is essentially splitting the line  into 4 parts separated by tab, stored the parts in $1 ~ $4. Then, print the first 3 parts followed by “;” and the last part ($4). The second line is similar, but this time, “;” is the delimiter so $1 contains three columns and $2 is the first Entrez ID. You repeat this printing process until it reaches to the last ID.

In order for these scripts to work, you need

1) the column you have multiple IDs are the far right.

2) The data are tab delimited (you can have any delimiter you like, just specify the right one).

3) Use for loop or something in the first command if you have lots of columns

The script should work for many IDs in the last column and complete the task very fast.


%d bloggers like this: