Connecting Any Number of Tables using SQL in R

In the previous post, I showed how to create a relational database using sqldf package in R. The example was with three tables but I would like to expand this program for any number of tables. Users can place as many (csv) files as they want in one directory, then the program reads all files, and connect them to create a database. I think this makes the program more practical as the number of files to connect varies dependent on the situation.

The input csv file in this example is derived from IDPicker. Note that gene accession on the left has only unique entries (no duplicates).
Screen Shot 2013-10-17 at 11.00.53 PM

Place multiple csv files in one directory.
Screen Shot 2013-10-17 at 11.03.43 PM

Then run a program to make a file looking like below. You see Gene ID (accession) and description followed by data from each file. If data are missing, you see NA.
Screen Shot 2013-10-17 at 11.07.31 PM

The structure & process of the program is

1) Install and read sqldf library (before running the program)
2) Specify directory with setwd() function
3) Read all tables (.csv) into R
4) Extract key (accession) from each table and create a table (key_table) containing all keys from the files using union
5) Connect a first table to key_table
6) Connect the second table to 5)
7) Loop connecting files until all files are connected
8) Save and return a connected table

I think it is better to use this program as function, and this function takes a directory as parameter. So you will use this function like

data<-connectAll("/home/data/exp1")

It will be also useful if the created table is saved automatically in the same directory. I will put this parameter as well (default will be FALSE).

data<-connectAll("/home/data/exp1",save=TRUE)

1) Install and read sqldf library (please see previous post for dependent packages for sqldf).

install.packages(sqldf)
library(sqldf)

2) Specify directory with setwd() function
The program takes an argument from the function call and change the directory. So it  starts like below.

connectAll<-function(dir, save=FALSE){
setwd(dir)
.......
}

3) Get file names in the directory, then read all tables (.csv) into R

files<-list.files(pattern=".csv")
rdf<-lapply(files, read.csv)

4) Extract key (gene IDs) & description from each table and create a table (key_table) containing all keys from the files using “union” in SQL. Note that you need to store each table in a new dataframe vector called x. If you put this statement inside of sqldf, you get an error. Then for loop will add unique entry of gene IDs until it reaches to the end. “length()” command takes care of the number of looping.

ftable<-key_table
for(i in 2:length(files)){
  x<-as.data.frame(rdf[i])
  ftable<-sqldf("select Accession, Description
  from key_table union select Accession
  , Description from x")
  }

5) Connect a first table to key_table
6) Connect the second table to 5)
7) Loop connecting files until all files are connected

ftable<-key_table

for(i in 1:length(files)){
  x<-as.data.frame(rdf[i])
  colnames(x)[colnames(x)=="Filtered.Spectra"]<-
    "Filtered_Spectra"
  ftable<-sqldf("select ftable.*,x.Filtered_Spectra
     FROM (ftable left join x ON ftable.Accession 
     = x.Accession)")
  colnames(ftable)[colnames(ftable)==
    "Filtered_Spectra"]<-paste(gsub("([A-Za-z0-9]+).csv"
    ,"\\1",files[i]),"_Filtered_Spectra",sep="")
  }

Here, I picked “filtered.spectra” column to show in the output. The sqldf doesn’t like having  a dot “.” in the name because it is confusing…. SQL uses “.” to specify column from tables, so having “.” in the name of the column doesn’t work. The last line adds file name for each column because it would be difficult to look at the table if all columns have the same name “filtered_spectra”. I selected “filtered_spectra”, but user can change or add to it depending on what they want to see.

8) Save file and return connected table. The name of the file will be “final.csv”.

if(save==TRUE){
#Replace all commas to spaces in description column
ftable$Description<-gsub(","," ",ftable$Description)     
write.table(ftable, file="final.csv",quote=FALSE,sep
  =",",row.names=FALSE) 
}
return(ftable)

This function is easy to expand more functionality. For example, you can add another argument to pick columns you want to show in the final table. This program was written for protein ID, but it can be done at ion levels with a charge state.

If you have to analyze data with database software (e.g. microsoft access) on daily basis, and your work can be done using the program like this, it will cut a significant amount of your time.

Final code is here.
Test csv files are here.

About bioinfomagician

Bioinformatic Scientist @ UCLA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: