Sunday, September 4, 2011

Technical Replicates Continued

Last week I looked at the theoretical side of comparing two technical replicates from next-generation sequencing experiments.  If all the variation can be accounted for by the randomness of sampling, then the read counts of a gene across technical replicates should follow a Poisson distribution.  By plotting the variance stabilized transformation for Poisson data, one can get a qualitative feel if the data is indeed Poisson distributed  - the plot should be homoscedastic (equal variance across the entire range of values) with a variance within the expected range.

This week I decided to try the transformations and plots out on real data and see what I get.  I used the data from the Marioni paper that discusses reproducibility among technical replicates.  The data consists of technical replicates of both liver and kidney at various concentrations.  I used their available aligned data, and counted (using  BEDTools) the number of reads that fell within each gene from the Hg18 knownGene annotation (I can provide some instructions and scripts of this if anyone is interested).

Raw Counts
Below is a comparison of the theoretical plots generated last time versus the plots of two Kidney technical replicates with the same concentrations for each sample:

Theoretical Raw Counts

Actual Raw Counts

Looks pretty similar to me.  As expected, variance grows with larger read counts (for Poisson distributed variables, the mean is equal to the variance).  There are more points plotted in the the theoretical plot, but the main thing to look at is the shape of the plots.

Anscombe Transform
Here I transform the data using the Anscombe Transform, which stabilizes variance for Poisson data.  This transform will make the variance approximately equal to 1 across the entire range of values.  The different shades of red represent the standard deviations:

white = +/- 1 stdev
light red = +/- 2 stdev
medium red = +/- 3 stdev
dark red > +/- 3 stdev).

Variance Stabilization of Theoretical Counts using the Anscombe transform

Variance Stabilization of Actual Counts using the Anscombe transform

ArcSine Transform

Variance Stabilization of Theoretical Proportions using ArcSine transformation

Variance Stabilization of Actual Proportions using ArcSine transformation

It seems that the theoretical plots match up quite nicely with the actual plots.  The points are between +/- 3 standard deviations (represented as the lighter shades of red), and the variance is pretty much uniform across the range of values (note: the variance seems to decrease slightly for larger values, but I am not sure if it just appears that way since the points are less dense at higher values).

Lets compare the above plots to the log-log plot below:

This log-log plot also shows that there is good correlation between the two samples, but it would be difficult to conclude whether or not the variability of read counts between samples can be accounted for by randomness of sampling alone.  On the other hand, in the above plots, checking whether or not the points are within the expected variance range and if the variance is uniform across all values is a good indication if the data fit a Poisson distribution or not.

Technical Replicates with Different Number of Total Counts

I also compared technical replicates of Kidney that had different concentrations.  In the above example, the concentrations were the same and the total number of reads are roughly the same.  Therefore, looking at the transformation of the raw counts or the proportions could both be used.  However, here we see an example where total read counts are not the same.  This is where looking at proportions comes in handy, where we can look at the relative proportion of read counts rather than absolute values.

Below is a plot of the raw counts with the Anscombe transform.  Notice that it veers off of the y=x line, as expected for samples with a different number of total reads.

Instead, we can look at proportions and use the arcsine transformation to stabilize variance:

It still looks like it veers off of the y=x line slightly, but it seems to work a little better than comparing the raw counts.

Kidney versus Liver

And just for fun, let's compare the kidney sample to the liver sample.  Quite a lot of differences between the two!

Sunday, August 28, 2011

The Use of Log-Log plots to show technical reproducibility of NGS data

Reproducibility, the ability for the same input to produce the same output, is an important concept in the sciences.  One of the claimed advantages of next-generation sequencing over its microarray predecessor is the high correlation among technical replicates.  One common way to show the reproducibility between two replicates is a log-log plot of RPKM values (RPKM normalizes number of reads by transcript length and mapped reads).  I would like to propose an alternative to the log-log plot that may have a clearer interpretation and may be more appropriate for read count data.

Log-Log RPKM plots
Below are some links showing the log-log RPKM plots between two samples.  The log-log plot is created by counting the number of reads that are mapped to the genes in Sample A and Sample B.  These reads are then transformed by first creating an RPKM value (normalize by transcript length and number of mapped reads) and taking the log of those RPKM values.  The logged RPKM gene values of Sample B vs. Sample A are plotted.

Links to Pictures of Log-Log RPKM plots between Technical Replicates
Figure 2A
FIgure 4B

MA Plots
My guess is that the RPKM values are logged because that is what is done to microarray expression values.  A common plot that is used to compare two microarray samples is the MA plot, which plots the difference between logged expression values against the mean of the logged expression values.  It would seem reasonable to try this same plot for Next Generation Sequencing data, and you can find such plots (e.g. Figure 5).

The tear-drop effect
The log-log RPKM plots share a common shape - they start out wide and get thinner and thinner as you move along the x-axis.  To me, it sort of looks like a leaf or tear-drop.  One might try to conclude from this plot that the variance is smaller for genes with higher number of reads.  Judging by a talk given in a recent Illumina User Group meeting, a common question is why the variance is seemingly higher for low expression values.  To throw in some statistical terms for good measure, people are expecting a homoscedastic plot (same variance across entire range of values) instead of a heteroscedastic plot (differing variance).

My contention is that it is not possible to draw a conclusion about how the variance relates to expression levels from the Log-Log RPKM plot.  The tear-drop shape may just be an artifact of logging the RPKM values.  Lots of data look really good (i.e. fits a line) when you log it.  For example, let's say you have the data points (1,10) and (10 thousand, 100 thousand).  Logging the values (base 10) gives (0,1) and (4,5), which show up as being equally distant from the y=x line.

If the data looked homoscedastic after logging the values, then the data would have a log-normal distribution.  This distribution seems to be a good fit for microarray data (see Durbin).  However, the tear-drop shape of the log-log RPKM counts may be the result of not using the right model - it has been shown (Marioni) that the variation across technical replicates can be captured using a Poisson model.  So why not use an appropriate transformation based on the Poisson model???

Variance Stabilization for Count data
Luckily, there is a transformation you can apply to Poisson generated count data which will stabilize the variance across all values of the data: the square root.  This will magically make the variance about 1/4, irrespective of the mean (although the approximation is more accurate for higher means).  A slightly more complex transformation is the Anscombe transform, which will make the variance approximately 1 and is a good approximation for mean larger than 4.

Simulated Data
Normal Data
Below is a plot of homoscedastic data.  I randomly chose a mean "mu" between 0 and 20.  I then generated y1 ~ N(mu, 1) and y2 ~ N(mu, 1) and plotted (y1, y2).

The distance a point is from the line y=x is

(y2-y1) ~ N(0, 2).

Note that the difference of two normally distributed variables has a normal distribution with mean equal to the difference of the two means and a variance equal to the sum of the two variances.  The red bands in the above picture represent +/- 1 standard deviation, +/- 2 standard deviations, and +/- 3 standard deviations, where the standard deviation is sqrt(2).

Distance from the point to the line is (y2-y1), which will have a distribution of N(0,2) about the y=x line.

Poisson Data (untransformed)
Below is some raw Poisson data (no transformation).  In Poisson data, the mean is equal to the variance so the spread of points increases for larger values of x and y.

Poisson Data (Log Transformed)
Here is a log-transfromed plot of the Poisson values.  Notice how the variance gets smaller for increasing values of x and y.  This is because the log-transform is too aggressive for the Poisson distribution.

Poisson (Anscombe Transform)
With the Anscombe Transform, the variance is stabilized and the plot looks more homoscedastic.  Deviations from homoscedasticity would indicate that the data does not fit the Poisson distribution very well.

Arcsine Transformation of Proportions:
When comparing two next generation samples, one might be more interested in whether or not the proportion of (reads / total reads) is the same across samples rather than the number of raw reads.  This is similar to dividing by total mapped reads in the RPKM calculation.  A variance stabilization method for proportions is the arcsine of the square-root of the proportion.  The variance of n/N after the transformation is approximately equal to  1/(4N).

Let Y_i be the total number of mapped reads in sample i and let y_gi be the number of reads that map to gene g in sample i.  Given two samples, we would like to see if the portion of reads that map to gene g in sample 1 (y_g1/Y1) equals the proportion of reads that map to gene g in sample 2 (y_g2/Y2).

Apply the variance stabilization transformation to get
x_g1 = arcsine(sqrt(y_g1/Y1));
x_g2 = arcsine(sqrt(y_g2/Y2));

Then, (x2 - x1) ~ N(0, (1/Y1 + 1/Y2)/4).
Below is plot where counts y_g1 and y_g2 were generated from a Poisson distribution with parameters mu and (Y2/Y1)*mu, respectively.  In this case, Y2 = 2 million, Y1 = 1 million, and standard deviation is about  sqrt((1/Y1 + 1/Y2)/4) = 0.0006.  The plot looks homoscedastic as expected.

The log transform at first seems like an appropriate transformation to compare Next Generation count data, just as microarray expression values were logged before comparing two samples.  However, the reason for logging the data was that it stabilizes variance of expression values from microarray platforms.  Technical replicates of NGS data follows a Poisson distribution (note that biological replicates do not in general follow a Poisson - we are just talking about technical replicates here), and so one should use an appropriate variance stabilization transformation (VST) for a Poisson distribution.  If the data is truly Poisson distributed, then a plot of the VST of Sample 1 versus Sample 2 should be homoskedastic (variance is the same across all values of the data).  If the plot has differing variance, then this would provide qualitative evidence that the data is not Poisson distributed, and there may be some "extra-Poisson" factors or "over-dispersion" to consider.  Since it may make more sense to compare the proportions of gene counts to total reads rather than raw gene counts, one may also apply the same technique to transform proportions, except using a transformation suitable for proportion data.

#include <iostream>
#include <cstdlib>
#include <cmath>

using namespace std;

/* Random Number Generation */

//Generate Poisson values with parameter lambda
int gen_poisson(double lambda){
  double L = -lambda;
  int k = 0;
  double p = 0;

    p += log(drand48());
  }while(p > L);

  return (k-1);

//Polar form of the Box-Muller transformation to generate standard Normal random numbers
void gen_normal(double & y1, double & y2){
  double x1, x2, w;

  do {
    x1 = 2.0 * drand48() - 1.0;
    x2 = 2.0 * drand48() - 1.0;
    w = x1 * x1 + x2 * x2;
  } while( w >= 1.0);

  w = sqrt( (-2.0 * log( w ) ) / w );
  y1 = x1 * w;
  y2 = x2 * w;

//Generate Gaussian numbers from a standard normal distribution.
void gen_gaussian(double mean, double sd, double & y1, double & y2){
  gen_normal(y1, y2);
  y1 = mean + sd*y1;
  y2 = mean + sd*y2;

/* Transformations */

//Anscombe transform
double vst_poisson(double k){
  return 2.0*sqrt(k + 3/8);

//Log transform
double vst_exp(double val){
  return log(val);

//Binomail VST
double vst_binomial(double n, double N){
  return asin(sqrt(n/N));

/*Print Data points*/

//Print normal data values.
void printNormal(){
  for(int i = 0; i < 10000; i++){
    double y1, y2;
    double mean = 20.0*drand48();
    gen_gaussian(mean, 1, y1, y2);
    cout << mean << " " << y1 << " " << y2 << endl;

//Print Raw Poisson counts
void printPoisson(){
  for(int i = 0; i < 10000; i++){
    double mean = 300.0*drand48();
    cout << mean << " " << gen_poisson(mean) << " " << gen_poisson(mean) << endl;



//Print variance stabilized Poisson data points.
void printVSTPoisson(){
  for(int i = 0; i < 10000; i++){
    double mean = 300.0*drand48();
    cout << vst_poisson(mean) << " " << vst_poisson((double)gen_poisson(mean)) << " " << vst_poisson((double)gen_poisson(mean)) << endl;



//Print logged Poisson data points.
void printLogPoisson(){
  for(int i = 0; i < 10000; i++){
    double mean = 300.0*drand48();
    cout << vst_exp(mean) << " " << vst_exp((double)gen_poisson(mean)) << " " << vst_exp((double)gen_poisson(mean)) << endl;



//Print variance stabilized Proportion data points.
void printVSTBinomial(){
  double Y1 = 1000000;
  double Y2 = 2000000;

  for(int i = 0; i < 10000; i++){
    double mean = 20000.0*drand48();
    cout << vst_binomial(mean, Y1) << " " << vst_binomial((double)gen_poisson(mean), Y1) << " " << vst_binomial((double)gen_poisson(mean*2.0), Y2) << endl;


int main(int argc, const char * argv[]){

  return 0;

Sunday, August 7, 2011

Sort FASTQ file by sequence

This tutorial will show you how to sort a FASTQ file by sequence. This is useful when you want to remove/count duplicate sequences for quality analysis or reduce the workload of the alignment stage. The BioStar post has some good suggestions for a similar task, although most would require the file to fit into memory, which is not a safe assumption for large next-gen data.

The FASTQ sort works by piping together 3 steps: (1) "linearize" the FASTQ files, (2) sort the lines using Unix sort, and (3) "de-linearize" the lines back into a sorted FASTQ file. You will need a Perl interpreter and the Unix sort command.

Linearize the FASTQ file

This perl script will take every 4 lines in the file and print them on one line, separated with tabs. There are a couple of assumptions: (1) The FASTQ sequence and quality fields do not wrap around to more than one line so that every 4 lines is indeed a FASTQ record, and (2) there are no tabs in the FASTQ files since we are using that as our delimiter.

#!/usr/bin/perl -w

use strict;

my $count = 0;
while(my $line = <STDIN>){
    print $line;
    $count = ($count + 1) % 4;
    if($count == 0){
        print "\n"
        print "\t";

Save the file as

Sort the sequences

We will use the Unix command

sort --stable -t $'\t' -k2,2

This says to sort the lines based on the second field (sequences) of the tab delimited file. I use the flag --stable so that records with the exact same sequence will be in the same relative order as they are in the original file. This is probably not necessary and if removed, the sort will most likely run faster. You can also specify the "-u" flag if you want only unique sequences.

De-linearize the file

After sorting, the FASTQ records need to be written back to the FASTQ format. The following script will split each line into four lines based on the tab delimiters.

#!/usr/bin/perl -w

use strict;

my $count = 0;
while(my $line = <STDIN>){
    my @vals = split(/\t/, $line);
    print $vals[0]."\n";
    print $vals[1]."\n";
    print $vals[2]."\n";
    print $vals[3]."\n";    

Save the file as

Run the program

The program works by streaming the input FASTQ file into the mergelines perl script, sorting the lines based on the sequence field, and splitting the lines using the splitlines perl script.

Assuming your FASTQ file is called test.fq, it would work as follows

cat test.fq | perl | sort --stable -t $'\t' -k2,2 | perl


> head -12 e_coli_1000.fq

cat e_coli_1000.fq | perl | sort --stable -t $'\t' -k2,2 | perl > sorted.fq

> head -12 sorted.fq

Bonus! Sorting SAM files with Unix

The next few lines show how to sort a SAM file based on sequence, read name, or genomic position.  The header (lines beginning wit "@") is removed, the remainder of the file is sorted, and the header and body are concatenated back together to form the sorted SAM file.

#Edit the next two lines with the name of the input SAM file and the desired name of the sorted output SAM file.

#Sort by sequence
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k10,10 | cat header.txt - > $OUTPUT

#Sort by read name
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k1,1 | cat header.txt - > $OUTPUT

#Sort by genomic position
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k3,3 -k4,4n | cat header.txt - > $OUTPUT

Saturday, August 6, 2011

Side-by-side windows on a Mac

I have a friend who uses a Windows machine at work, and an Apple laptop at home. She told me about a new feature in Windows 7 that allows you to snap windows to the left half or the right half of the screen, giving you the ability to view two windows side-by-side. Being a proud Apple user, I told her that I could give her a similar feature on a Mac. Below are my instructions for Snow Leopard 10.6.7 (I will update these instructions for Lion if need be), using a couple of AppleScripts and the Automator. I found this link very helpful when figuring out how to create a keyboard shortcut of an AppleScript.

Desired Outcome:

By holding down the key combination Shift-Command-Comma or Shift-Command-Period, you will be able to snap the active window to the left half or to the right half of the screen, respectively. This will allow you to show two windows side-by-side, which is nice for comparing two documents (or for having the baseball game scores and your work document open side-by-side).

Step 1 - Open Automator

Automator is located in Macintosh HD > Applications. It has a cute little icon of a robot holding a pipe (representing workflows, I guess. Or maybe it is a bazooka). Double-click that little guy.

Step 2 - Select "Service"

The first thing you see should be a dialog for choosing a template. Choose "Service" (the gear icon) and click Choose.

Step 3 - Set the Input

We want to be able to launch the AppleScripts no matter where we are or which application we are currently runnning. Therefore, you need to set "Service receives selected" to "no input" in the drop down combo-box. You may leave the other comb-box as "any application".

Step 4 - Insert the AppleScript

In the search box with the magnifying glass, type "Run AppleScript". This should give you one hit. Drag that action into the large gray workspace which says "Drag actions or files here to build your workflow".

Delete the line that says (* Your script goes here*), and copy and paste the following code into that section:

tell application "Finder"
set screen_resolution to bounds of window of desktop
set screen_width to item 3 of screen_resolution
set screen_height to item 4 of screen_resolution
end tell

tell application "System Events"
set frontmostApplication to name of the first process whose frontmost is true
end tell

tell application frontmostApplication
set the bounds of the first window to {0, 0, screen_width / 2, screen_height}
end tell

You can test it out by hitting the green play button. It should snap your current window (which in this case is the automator) to the left half of your screen. If you want to leave some borders, you can modify the bounds of the windows in the script.

Step 5 - Save as Service.

Choose File > Save as ...
Give it a name, such as Tile_Left

Step 6 - Create a Tile_Right script

Go to File > New.
Repeat steps 2 through 5. This time, you will insert the following code:

tell application "Finder"
set screen_resolution to bounds of window of desktop
set screen_width to item 3 of screen_resolution
set screen_height to item 4 of screen_resolution
end tell

tell application "System Events"
set frontmostApplication to name of the first process whose frontmost is true
end tell

tell application frontmostApplication
set the bounds of the first window to {screen_width / 2, 0, screen_width, screen_height}
end tell

Save the file as a service. I called it Tile_Right

After this, you may quit Automator. Almost done!

Step 7 - Assign keyboard shortcuts to Tile_Left and Tile_Right

Open System Preferences (Go to Apple icon in top left corner and select "System Preferences..." Choose Keyboard > Keyboard Shortcuts > Services

In the right window, scroll down until you see your new services. Mine show up at the bottom under General. I called mine "Tile_Left" and "Tile_Right". Make sure the check marks are on for both of them.

To assign a keyboard short cut, highlight the action by clicking it once. Then click once on the right side of the highlighted region where the key commands will go (I know this is a confusing explanation - see picture for some guidance). Enter the key combination you would like to use. Note that some key combinations cannot be used or are already taken. I mapped Tile_Left to Shift-Command-Comma and Tile_Right to Shift-Command-Period. Just hold down those three keys and the symbols should show up to the right of the action.

Setting up keyboard shortcuts for AppleScript actions

Step 8 - Try it out

Try out the new keyboard shortcuts. The active window should snap to either the left or the right half of the screen. Post some replies if you are having trouble, and I will try to answer them. Hope you like it!

Sunday, July 10, 2011

BAMseek - large file viewer

Okay, time to get down to brass tacks (don't worry, even as a native English speaker, I don't know what that means).  First thing I will talk about (shamelessly plug) is a piece of software I have been working on as a hobby called BAMseek.  Bioinformatics have a lot of file formats.  Check out here to see a few.  With the large sizes of data produced by next-generation sequencing, efficient and compact storage of the data is required.  This makes viewing the files difficult for most people because of the large sizes (tens or even hundreds of gigabytes) and the binary format they are in.  I created a file viewer for these large, binary files to allow people to browse inside those gigantic, cryptic files.  Right now, BAMseek supports SAM/BAM, VCF, FASTQ, and SFF.  It indexes each 1000 lines of the file into pages and allows you to randomly jump to various pages throughout the file.  Let me know how the tool works for you.

First Post!

So, I decided to start a blog.  As the name suggests, I want to write about interesting things I see in the field of bioinformatics - tools, software, algorithms, papers, . . .  It is mainly a place where I can get down on paper (well uh electronic paper) some of the things I am currently thinking about.  I hope others can find this useful too.  I am a software developer working the past 3 years on next-generation sequencing data.  So let me know if you find this site useful or not, or just want to say "hi".  Thanks for visiting and enjoy your day!