Sunday, August 28, 2011

The Use of Log-Log plots to show technical reproducibility of NGS data

Reproducibility, the ability for the same input to produce the same output, is an important concept in the sciences.  One of the claimed advantages of next-generation sequencing over its microarray predecessor is the high correlation among technical replicates.  One common way to show the reproducibility between two replicates is a log-log plot of RPKM values (RPKM normalizes number of reads by transcript length and mapped reads).  I would like to propose an alternative to the log-log plot that may have a clearer interpretation and may be more appropriate for read count data.

Log-Log RPKM plots
Below are some links showing the log-log RPKM plots between two samples.  The log-log plot is created by counting the number of reads that are mapped to the genes in Sample A and Sample B.  These reads are then transformed by first creating an RPKM value (normalize by transcript length and number of mapped reads) and taking the log of those RPKM values.  The logged RPKM gene values of Sample B vs. Sample A are plotted.

Links to Pictures of Log-Log RPKM plots between Technical Replicates
Figure 2A
FIgure 4B

MA Plots
My guess is that the RPKM values are logged because that is what is done to microarray expression values.  A common plot that is used to compare two microarray samples is the MA plot, which plots the difference between logged expression values against the mean of the logged expression values.  It would seem reasonable to try this same plot for Next Generation Sequencing data, and you can find such plots (e.g. Figure 5).

The tear-drop effect
The log-log RPKM plots share a common shape - they start out wide and get thinner and thinner as you move along the x-axis.  To me, it sort of looks like a leaf or tear-drop.  One might try to conclude from this plot that the variance is smaller for genes with higher number of reads.  Judging by a talk given in a recent Illumina User Group meeting, a common question is why the variance is seemingly higher for low expression values.  To throw in some statistical terms for good measure, people are expecting a homoscedastic plot (same variance across entire range of values) instead of a heteroscedastic plot (differing variance).

My contention is that it is not possible to draw a conclusion about how the variance relates to expression levels from the Log-Log RPKM plot.  The tear-drop shape may just be an artifact of logging the RPKM values.  Lots of data look really good (i.e. fits a line) when you log it.  For example, let's say you have the data points (1,10) and (10 thousand, 100 thousand).  Logging the values (base 10) gives (0,1) and (4,5), which show up as being equally distant from the y=x line.

If the data looked homoscedastic after logging the values, then the data would have a log-normal distribution.  This distribution seems to be a good fit for microarray data (see Durbin).  However, the tear-drop shape of the log-log RPKM counts may be the result of not using the right model - it has been shown (Marioni) that the variation across technical replicates can be captured using a Poisson model.  So why not use an appropriate transformation based on the Poisson model???

Variance Stabilization for Count data
Luckily, there is a transformation you can apply to Poisson generated count data which will stabilize the variance across all values of the data: the square root.  This will magically make the variance about 1/4, irrespective of the mean (although the approximation is more accurate for higher means).  A slightly more complex transformation is the Anscombe transform, which will make the variance approximately 1 and is a good approximation for mean larger than 4.

Simulated Data
Normal Data
Below is a plot of homoscedastic data.  I randomly chose a mean "mu" between 0 and 20.  I then generated y1 ~ N(mu, 1) and y2 ~ N(mu, 1) and plotted (y1, y2).

The distance a point is from the line y=x is

(y2-y1) ~ N(0, 2).

Note that the difference of two normally distributed variables has a normal distribution with mean equal to the difference of the two means and a variance equal to the sum of the two variances.  The red bands in the above picture represent +/- 1 standard deviation, +/- 2 standard deviations, and +/- 3 standard deviations, where the standard deviation is sqrt(2).

Distance from the point to the line is (y2-y1), which will have a distribution of N(0,2) about the y=x line.

Poisson Data (untransformed)
Below is some raw Poisson data (no transformation).  In Poisson data, the mean is equal to the variance so the spread of points increases for larger values of x and y.

Poisson Data (Log Transformed)
Here is a log-transfromed plot of the Poisson values.  Notice how the variance gets smaller for increasing values of x and y.  This is because the log-transform is too aggressive for the Poisson distribution.

Poisson (Anscombe Transform)
With the Anscombe Transform, the variance is stabilized and the plot looks more homoscedastic.  Deviations from homoscedasticity would indicate that the data does not fit the Poisson distribution very well.

Arcsine Transformation of Proportions:
When comparing two next generation samples, one might be more interested in whether or not the proportion of (reads / total reads) is the same across samples rather than the number of raw reads.  This is similar to dividing by total mapped reads in the RPKM calculation.  A variance stabilization method for proportions is the arcsine of the square-root of the proportion.  The variance of n/N after the transformation is approximately equal to  1/(4N).

Let Y_i be the total number of mapped reads in sample i and let y_gi be the number of reads that map to gene g in sample i.  Given two samples, we would like to see if the portion of reads that map to gene g in sample 1 (y_g1/Y1) equals the proportion of reads that map to gene g in sample 2 (y_g2/Y2).

Apply the variance stabilization transformation to get
x_g1 = arcsine(sqrt(y_g1/Y1));
x_g2 = arcsine(sqrt(y_g2/Y2));

Then, (x2 - x1) ~ N(0, (1/Y1 + 1/Y2)/4).
Below is plot where counts y_g1 and y_g2 were generated from a Poisson distribution with parameters mu and (Y2/Y1)*mu, respectively.  In this case, Y2 = 2 million, Y1 = 1 million, and standard deviation is about  sqrt((1/Y1 + 1/Y2)/4) = 0.0006.  The plot looks homoscedastic as expected.

The log transform at first seems like an appropriate transformation to compare Next Generation count data, just as microarray expression values were logged before comparing two samples.  However, the reason for logging the data was that it stabilizes variance of expression values from microarray platforms.  Technical replicates of NGS data follows a Poisson distribution (note that biological replicates do not in general follow a Poisson - we are just talking about technical replicates here), and so one should use an appropriate variance stabilization transformation (VST) for a Poisson distribution.  If the data is truly Poisson distributed, then a plot of the VST of Sample 1 versus Sample 2 should be homoskedastic (variance is the same across all values of the data).  If the plot has differing variance, then this would provide qualitative evidence that the data is not Poisson distributed, and there may be some "extra-Poisson" factors or "over-dispersion" to consider.  Since it may make more sense to compare the proportions of gene counts to total reads rather than raw gene counts, one may also apply the same technique to transform proportions, except using a transformation suitable for proportion data.

#include <iostream>
#include <cstdlib>
#include <cmath>

using namespace std;

/* Random Number Generation */

//Generate Poisson values with parameter lambda
int gen_poisson(double lambda){
  double L = -lambda;
  int k = 0;
  double p = 0;

    p += log(drand48());
  }while(p > L);

  return (k-1);

//Polar form of the Box-Muller transformation to generate standard Normal random numbers
void gen_normal(double & y1, double & y2){
  double x1, x2, w;

  do {
    x1 = 2.0 * drand48() - 1.0;
    x2 = 2.0 * drand48() - 1.0;
    w = x1 * x1 + x2 * x2;
  } while( w >= 1.0);

  w = sqrt( (-2.0 * log( w ) ) / w );
  y1 = x1 * w;
  y2 = x2 * w;

//Generate Gaussian numbers from a standard normal distribution.
void gen_gaussian(double mean, double sd, double & y1, double & y2){
  gen_normal(y1, y2);
  y1 = mean + sd*y1;
  y2 = mean + sd*y2;

/* Transformations */

//Anscombe transform
double vst_poisson(double k){
  return 2.0*sqrt(k + 3/8);

//Log transform
double vst_exp(double val){
  return log(val);

//Binomail VST
double vst_binomial(double n, double N){
  return asin(sqrt(n/N));

/*Print Data points*/

//Print normal data values.
void printNormal(){
  for(int i = 0; i < 10000; i++){
    double y1, y2;
    double mean = 20.0*drand48();
    gen_gaussian(mean, 1, y1, y2);
    cout << mean << " " << y1 << " " << y2 << endl;

//Print Raw Poisson counts
void printPoisson(){
  for(int i = 0; i < 10000; i++){
    double mean = 300.0*drand48();
    cout << mean << " " << gen_poisson(mean) << " " << gen_poisson(mean) << endl;



//Print variance stabilized Poisson data points.
void printVSTPoisson(){
  for(int i = 0; i < 10000; i++){
    double mean = 300.0*drand48();
    cout << vst_poisson(mean) << " " << vst_poisson((double)gen_poisson(mean)) << " " << vst_poisson((double)gen_poisson(mean)) << endl;



//Print logged Poisson data points.
void printLogPoisson(){
  for(int i = 0; i < 10000; i++){
    double mean = 300.0*drand48();
    cout << vst_exp(mean) << " " << vst_exp((double)gen_poisson(mean)) << " " << vst_exp((double)gen_poisson(mean)) << endl;



//Print variance stabilized Proportion data points.
void printVSTBinomial(){
  double Y1 = 1000000;
  double Y2 = 2000000;

  for(int i = 0; i < 10000; i++){
    double mean = 20000.0*drand48();
    cout << vst_binomial(mean, Y1) << " " << vst_binomial((double)gen_poisson(mean), Y1) << " " << vst_binomial((double)gen_poisson(mean*2.0), Y2) << endl;


int main(int argc, const char * argv[]){

  return 0;


  1. I think the use of sqrt or Anscombe transformation is very useful and should become the industry standard. Are you planning a to write a short technical note about that? I would encourage to do that, may be in Bioinformatics, or in any of the Genomics journals.


  2. Very nice post.

    Thanks a lot for sharing.