Wednesday, October 3, 2012

Re-plumbing the High-Throughput Sequencing Pipeline

The clogged pipe - when pipelines become a tangled mess

A common task among bioinformatics working on next-generation sequencing is the creation of pipelines - the piecing together of tasks that take the data from its raw form (typically short sequencing reads) to information that can be interpreted by the investigator (such as identification of disease causing genetic variants).  It is natural to think of this abstractly as data flowing through a pipeline.

This idea of a pipeline becomes less and less tenable when the analyses become more complex and more numerous.  For example, lets say you want to create two typical pipelines: one that calls variants and one that calculates gene expression, such as shown below:



So to call variants, you would run the top pipeline, and to calculate gene expression, you would run the bottom pipeline.  Seems simple enough, right?  Let's take a look at the computation that occurs underneath the hood:


The thing to note here is that the first two out of three steps in these pipelines are the same.  Creating separate pipelines that have substantial overlap lead to the following issues:

  • Redundant running of computationally intensive tasks
  • Redundant efforts in creating new pipelines
One could argue that these effects can be minimized by checking whether or not a file exists before executing a step.  This would prevent tasks from needlessly being re-run.  However, the burden still falls upon the pipeline creator to describe every step of the pipeline from start to finish and implement the logic needed to determine when to run a step, which leads to redundant code across overlapping pipelines.

The pipe dream - minimizing unnecessary work

All is not lost.  An alternative and more general approach to pipelines is to use a dependency tree.  Each step has a defined set of input files and output files and knows how to create the output based on the inputs.  If one of those inputs does not exist, then the step at hand says "go off and create that input and don't come back until you do!", and this precedes in a recursive manner.  An analogy to this would be my feeble attempts at cooking spaghetti.  My output is spaghetti and my inputs are noodles, ground beef, and Prego tomato sauce.  Of course, there are a multitude of complex steps involved in making the Prego sauce but that does not concern me - I only care that it is available at the grocery store (a better analogy would be if the Prego sauce was only made when I demanded them to make it!).  The two pipeline example would look like the following using a dependency tree:

   The advantages are that
  • Tasks are only run as-needed (lazy evaluation)
  • Bioinformaticians can focus on an isolated step in which a desired output is created from a set of input files and not worry about the steps needed to create those inputs.

The plumber - Makefiles to the rescue!

Unfortunately, I am not the first to discover the benefits of dependency trees, or even dependency trees within the context of bioinformatics.  The Unix Makefile system is a concrete implementation of the idea of dependency trees.  People familiar with programming ought to be familiar with Makefiles, but I would venture to guess not many have thought about its usefulness beyond compiling code.  The Makefile system allows you to specify the dependency tree by stating the inputs and output of each task and the recipe for creating that output.  The Makefile system will determine if the inputs are available and will create them if they are not, implicitly managing that logic for you.

Here are two interesting reads describing this type of approach in the context of bioinformatics:
In my next post, I'll talk about the actual implementation I use for running high-throughput sequencing analyses.  Akin to how Macgyver can fix anything with paper clips and rubber bands, I'll show you how to run analyses using nothing more than a Makefile and light-weight shell scripts that wrap informatics tools such as Samtools, the Integrative Genome Viewer, and BWA.  The advantages of this approach is that you can
  • Run an analysis simply by declaring which file(s) you desire (such as make foo.bam)
  • Install on a cluster and allow the Makefile to queue up jobs to run in parallel
  • Restart an analysis after a failure, without re-doing successfully completed intermediate steps.
This is something I have been developing over the past year and have been using successfully on a daily bases to submit jobs to the Sun Grid Engine on a several thousand node cluster.  I would be happy to hear any suggestions, feedback, or comments on this or future blogs.

thanks!
Justin


No comments:

Post a Comment