Wednesday, October 31, 2012

Gene Names Are Broken

With the completion of the human genome project and the advent of next generation sequencing technologies, the wealth of information about our genes is growing at a rapid pace.  Figuring out the roles of these genes is a complex undertaking in and of itself.  However, we make things much harder for ourselves by giving genes multiple symbols or symbols that have another meaning.  Try finding all abstracts in PUBMED that discuss the gene KIT.  Not so easy (especially since the searches are case insensitive and kit may refer to many other things).  Some of the gene names are as amusing as they are ridiculous (see this blog for some interesting ones, such as pokemon or sonic hedgehog - the approved gene name).

Now, I'm not the first to notice this, and there is actually a committee called the HUGO Gene Nomenclature Committee (HGNC) to "assign unique gene symbols and names to over 33,500 human loci".  Too bad some of these symbols are utterly useless.  The symbols may be unique with respect to other gene symbols, but they are far from unique and distinguishing.  

Here are a few of my (least) favorites, there are numerous other examples:
Edit (Nov 5, 2012):  Here are a few more wonderful examples

And the worse one ever, drumroll . . . . 
T :  Yes, there is a gene with the approved symbol of T (I pity the fool). Good luck finding any information about that gene.

Here is a breakdown of the lengths of the gene symbols:
# of Names    Length of Name
           1                1
         31                2
       615                3
     3560                4
     6296                5
     4699                6
     2468                7
     1143                8
      216                 9
        18               10
          1               11
          0               12
          1               13

Why does this matter?  There is too much information out there for a single person or army of people to sit down and wade through.  There is more of a need for automated methods to assist in culling and processing the information.  But when it is a challenging problem just to find the terms we are interested in, we are starting down a difficult road before we even get into the car.  

Often times an abstract or paper will use the gene symbol rather than the full, laborious gene name, and these "official gene symbols" are too nondescript to be useful in an automated search.  As the rate of information about genes out-paces our abilities to manually curate it, useful information might be lost or false conclusions may be drawn due to the ambiguity of our naming conventions.  

Monday, October 15, 2012

Fun with Bacteria

Recently, I've taken an interest in metagenomics, which involves identifying micro-organisms directly from environmental samples (such as the ocean, soil, human gut, and even the belly button).  The identification of the organisms can be accomplished by reading the DNA within the environmental sample and determining which type of organism the DNA came from.  The advantage of this approach is that you can study micro-organisms that cannot be easily cultured in the laboratory, but the disadvantage is that all the DNA is mixed together, and it is a challenge to identify which species are represented.

One approach for identifying different varieties of bacteria is to look at a specific 1,500 length sequence referred to as 16S ribosomal RNA.  This sequence is ubiquitous in bacteria but various enough that different species of bacteria have slightly different sequences.  These sequences can be captured from a sample and read, and used as barcodes to identify what types and the abundances of bacteria are present.

I thought it would be an interesting exercise to plot the dissimilarities of these sequences.  Sequences that are closer to each other would indicate species that have diverged from each other more recently.  I was curious to see the relationships among these different bacteria species.  Data was obtained from the Human Oral Microbiome Database.   I used multidimensional scaling to plot the data in a low dimensional space that could be visualized.  The metric I used to calculate distances between sequences was the Levenshtein (edit) distance, which is the minimum number of edits (substitutions, deletions and insertions) to transfer one sequence into the other.

I color-coded the data points by the taxonomic ranks (Domain, Phylum, Class, Order, Family, Genus).  As one would expect, species that are in similar ranks tend to be closer together on the plots.  There seems to be three main clusters: (1) the Bacteroidetes, (2) the Spirochaetes, and (3) everything else.  The everything else group seems to contain clusters too, just not as separated as the main 3.  From the 3D plot, you can see that the Proteobacteria (Betaproteobacteria and Gammaproteobacteria) separate relatively well from cluster 3.

Here is the MDS plot colored by Class (2D):

and also 3D:

Also, here is the same plot colored by some of the other taxonomic ranks


* plots generated in Partek Genomics Suite

R code used to calculate multi-dimensional scaling:

ribo <- read.table("data.txt", header=TRUE)
fit <- cmdscale(ribo, eig=TRUE, k=3)
x <- fit$points[,1]
y <- fit$points[,2]
z <- fit$points[,3]

Wednesday, October 3, 2012

Re-plumbing the High-Throughput Sequencing Pipeline

The clogged pipe - when pipelines become a tangled mess

A common task among bioinformatics working on next-generation sequencing is the creation of pipelines - the piecing together of tasks that take the data from its raw form (typically short sequencing reads) to information that can be interpreted by the investigator (such as identification of disease causing genetic variants).  It is natural to think of this abstractly as data flowing through a pipeline.

This idea of a pipeline becomes less and less tenable when the analyses become more complex and more numerous.  For example, lets say you want to create two typical pipelines: one that calls variants and one that calculates gene expression, such as shown below:

So to call variants, you would run the top pipeline, and to calculate gene expression, you would run the bottom pipeline.  Seems simple enough, right?  Let's take a look at the computation that occurs underneath the hood:

The thing to note here is that the first two out of three steps in these pipelines are the same.  Creating separate pipelines that have substantial overlap lead to the following issues:

  • Redundant running of computationally intensive tasks
  • Redundant efforts in creating new pipelines
One could argue that these effects can be minimized by checking whether or not a file exists before executing a step.  This would prevent tasks from needlessly being re-run.  However, the burden still falls upon the pipeline creator to describe every step of the pipeline from start to finish and implement the logic needed to determine when to run a step, which leads to redundant code across overlapping pipelines.

The pipe dream - minimizing unnecessary work

All is not lost.  An alternative and more general approach to pipelines is to use a dependency tree.  Each step has a defined set of input files and output files and knows how to create the output based on the inputs.  If one of those inputs does not exist, then the step at hand says "go off and create that input and don't come back until you do!", and this precedes in a recursive manner.  An analogy to this would be my feeble attempts at cooking spaghetti.  My output is spaghetti and my inputs are noodles, ground beef, and Prego tomato sauce.  Of course, there are a multitude of complex steps involved in making the Prego sauce but that does not concern me - I only care that it is available at the grocery store (a better analogy would be if the Prego sauce was only made when I demanded them to make it!).  The two pipeline example would look like the following using a dependency tree:

   The advantages are that
  • Tasks are only run as-needed (lazy evaluation)
  • Bioinformaticians can focus on an isolated step in which a desired output is created from a set of input files and not worry about the steps needed to create those inputs.

The plumber - Makefiles to the rescue!

Unfortunately, I am not the first to discover the benefits of dependency trees, or even dependency trees within the context of bioinformatics.  The Unix Makefile system is a concrete implementation of the idea of dependency trees.  People familiar with programming ought to be familiar with Makefiles, but I would venture to guess not many have thought about its usefulness beyond compiling code.  The Makefile system allows you to specify the dependency tree by stating the inputs and output of each task and the recipe for creating that output.  The Makefile system will determine if the inputs are available and will create them if they are not, implicitly managing that logic for you.

Here are two interesting reads describing this type of approach in the context of bioinformatics:
In my next post, I'll talk about the actual implementation I use for running high-throughput sequencing analyses.  Akin to how Macgyver can fix anything with paper clips and rubber bands, I'll show you how to run analyses using nothing more than a Makefile and light-weight shell scripts that wrap informatics tools such as Samtools, the Integrative Genome Viewer, and BWA.  The advantages of this approach is that you can
  • Run an analysis simply by declaring which file(s) you desire (such as make foo.bam)
  • Install on a cluster and allow the Makefile to queue up jobs to run in parallel
  • Restart an analysis after a failure, without re-doing successfully completed intermediate steps.
This is something I have been developing over the past year and have been using successfully on a daily bases to submit jobs to the Sun Grid Engine on a several thousand node cluster.  I would be happy to hear any suggestions, feedback, or comments on this or future blogs.