Archive | Relational Databases RSS for this section

Connecting Any Number of Tables using SQL in R

In the previous post, I showed how to create a relational database using sqldf package in R. The example was with three tables but I would like to expand this program for any number of tables. Users can place as many (csv) files as they want in one directory, then the program reads all files, and connect them to create a database. I think this makes the program more practical as the number of files to connect varies dependent on the situation.

The input csv file in this example is derived from IDPicker. Note that gene accession on the left has only unique entries (no duplicates).
Screen Shot 2013-10-17 at 11.00.53 PM

Place multiple csv files in one directory.
Screen Shot 2013-10-17 at 11.03.43 PM

Then run a program to make a file looking like below. You see Gene ID (accession) and description followed by data from each file. If data are missing, you see NA.
Screen Shot 2013-10-17 at 11.07.31 PM

The structure & process of the program is

1) Install and read sqldf library (before running the program)
2) Specify directory with setwd() function
3) Read all tables (.csv) into R
4) Extract key (accession) from each table and create a table (key_table) containing all keys from the files using union
5) Connect a first table to key_table
6) Connect the second table to 5)
7) Loop connecting files until all files are connected
8) Save and return a connected table

I think it is better to use this program as function, and this function takes a directory as parameter. So you will use this function like

data<-connectAll("/home/data/exp1")

It will be also useful if the created table is saved automatically in the same directory. I will put this parameter as well (default will be FALSE).

data<-connectAll("/home/data/exp1",save=TRUE)

1) Install and read sqldf library (please see previous post for dependent packages for sqldf).

install.packages(sqldf)
library(sqldf)

2) Specify directory with setwd() function
The program takes an argument from the function call and change the directory. So it  starts like below.

connectAll<-function(dir, save=FALSE){
setwd(dir)
.......
}

3) Get file names in the directory, then read all tables (.csv) into R

files<-list.files(pattern=".csv")
rdf<-lapply(files, read.csv)

4) Extract key (gene IDs) & description from each table and create a table (key_table) containing all keys from the files using “union” in SQL. Note that you need to store each table in a new dataframe vector called x. If you put this statement inside of sqldf, you get an error. Then for loop will add unique entry of gene IDs until it reaches to the end. “length()” command takes care of the number of looping.

ftable<-key_table
for(i in 2:length(files)){
  x<-as.data.frame(rdf[i])
  ftable<-sqldf("select Accession, Description
  from key_table union select Accession
  , Description from x")
  }

5) Connect a first table to key_table
6) Connect the second table to 5)
7) Loop connecting files until all files are connected

ftable<-key_table

for(i in 1:length(files)){
  x<-as.data.frame(rdf[i])
  colnames(x)[colnames(x)=="Filtered.Spectra"]<-
    "Filtered_Spectra"
  ftable<-sqldf("select ftable.*,x.Filtered_Spectra
     FROM (ftable left join x ON ftable.Accession 
     = x.Accession)")
  colnames(ftable)[colnames(ftable)==
    "Filtered_Spectra"]<-paste(gsub("([A-Za-z0-9]+).csv"
    ,"\\1",files[i]),"_Filtered_Spectra",sep="")
  }

Here, I picked “filtered.spectra” column to show in the output. The sqldf doesn’t like having  a dot “.” in the name because it is confusing…. SQL uses “.” to specify column from tables, so having “.” in the name of the column doesn’t work. The last line adds file name for each column because it would be difficult to look at the table if all columns have the same name “filtered_spectra”. I selected “filtered_spectra”, but user can change or add to it depending on what they want to see.

8) Save file and return connected table. The name of the file will be “final.csv”.

if(save==TRUE){
#Replace all commas to spaces in description column
ftable$Description<-gsub(","," ",ftable$Description)     
write.table(ftable, file="final.csv",quote=FALSE,sep
  =",",row.names=FALSE) 
}
return(ftable)

This function is easy to expand more functionality. For example, you can add another argument to pick columns you want to show in the final table. This program was written for protein ID, but it can be done at ion levels with a charge state.

If you have to analyze data with database software (e.g. microsoft access) on daily basis, and your work can be done using the program like this, it will cut a significant amount of your time.

Final code is here.
Test csv files are here.

Advertisements

Create A Quick Relational Database For Data Analysis Using R

Mass spec analysis pipeline often requires combining all data from a single experiment and create a relational database for final visualization. Analysis programs such as PeptideShaker or IDPicker produce ID files for individual samples, but it doesn’t have a feature to combine different conditions and compare them at the same time. So I am going to show how you can combine multiple datafile using R.

To summarize the process

1) Save result files (.csv or .txt) in one directory

2) Load all files into R

3) Create a table that contains ID and Description without redundancy

4) Connect all data files to 3)

To create a relational database in R, there are several required packages. The most important one is called “sqldf” packages. However, there are some other packages in order to use this package. In my mac, these are required to use sqldf. Note that to use sqldf, you also need to install XQuarts.2.7.4 (or some sort) on your mac.

>install.packages(“DBI”)
>install.packages(“gsubfn”)
>install.packages(“proto”)
>install.packages(“chron”)
>install.packages(“RSQLite”)
>install.packages(“RSQLite.extfuns”)
>install.packages(“sqldf”)

Then load these packages

>library(“DBI”)
>library(“gubfn”)
>library(“proto”)
>library(“chron”)
>library(“RSQLite”)
>library(“RSQLite.extfuns”)
>library(“sqldf”)

Set  up working directory (I am going to  use R-test on desktop)
>setwd(“~/Desktop/R-test”)

Read data file (.csv) into R. For .txt file use an appropriate separator, usually tab delimited (sep=”\t”).

>files<-list.files(pattern=”*.csv”)
>files # show filenames in the working directory
>rdf<-lapply(files, read.csv) # files are loaded into rdf

rdf is a list composed of multiple tables loaded from the working directory.
For simplicity, we have three files in the directory. You can do step for many more files using for-loop or apply function. In this example,the data files contain three columns, Accession, Description and Score. For example,

table1
Accession Description Score
1                   Gene A          100
2                   Gene B           50
3                   Gene C          75

table2
Accession Description Score
2                   Gene B          42
5                   Gene E          88
3                   Gene C          37

table3
Accession Description Score
1                   Gene A          100
3                   Gene C           50
6                   Gene F          22

Note: There shouldn’t be any duplicates in accession in each table. If it does, the final list will be a longer one with many duplicates.

Convert the three tables in the list to data frame and stored in table 1, 2 and 3.

>table1<-as.data.frame(rdf[1])
>table2<-as.data.frame(rdf[2])
>table3<-as.data.frame(rdf[3])

Create a table that contains only Accession and Description using union

> unionall<-sqldf(“select Accession, Description from table1 union select Accession, Description from table2 union select Accession, Description from table3”)

Finally, merge all three files and accession+description table.

>mergeall<-sqldf(“select unionall.*, table1.Score, table2.Score, table3.Score FROM ((unionall LEFT JOIN table1 ON unionall.Accession = table1.Accession) LEFT JOIN table2 ON unionall.Accession = table2.Accession) LEFT JOIN table3 ON unionall.Accession = table3.Accession”)

The mergeall table should look like this

Accession    Description   Score Score Score
1                   Gene A         100    NA      100
2                   Gene B          50      42       NA
3                   Gene C          75      37       50
5                   Gene E          NA     88       NA
6                   Gene F          NA     NA      22

You see that each accession number only appears once. If some genes are not identified in particular samples, they will be left as NA.

%d bloggers like this: