Last week I looked at the theoretical side of comparing two technical replicates from next-generation sequencing experiments. If all the variation can be accounted for by the randomness of sampling, then the read counts of a gene across technical replicates should follow a Poisson distribution. By plotting the variance stabilized transformation for Poisson data, one can get a qualitative feel if the data is indeed Poisson distributed - the plot should be homoscedastic (equal variance across the entire range of values) with a variance within the expected range.

This week I decided to try the transformations and plots out on real data and see what I get. I used the data from the Marioni paper that discusses reproducibility among technical replicates. The data consists of technical replicates of both liver and kidney at various concentrations. I used their available aligned data, and counted (using BEDTools) the number of reads that fell within each gene from the Hg18 knownGene annotation (I can provide some instructions and scripts of this if anyone is interested).

**Raw Counts**

Below is a comparison of the theoretical plots generated last time versus the plots of two Kidney technical replicates with the same concentrations for each sample:

**Theoretical**Raw Counts

**Actual**Raw Counts

Looks pretty similar to me. As expected, variance grows with larger read counts (for Poisson distributed variables, the mean is equal to the variance). There are more points plotted in the the theoretical plot, but the main thing to look at is the shape of the plots.

**Anscombe Transform**

Here I transform the data using the Anscombe Transform, which stabilizes variance for Poisson data. This transform will make the variance approximately equal to 1 across the entire range of values. The different shades of red represent the standard deviations:

white = +/- 1 stdev

light red = +/- 2 stdev

medium red = +/- 3 stdev

dark red > +/- 3 stdev).

Variance Stabilization of

**Theoretical**Counts using the Anscombe transform

Variance Stabilization of

**Actual**Counts using the Anscombe transform

**ArcSine Transform**

Variance Stabilization of

**Theoretical**Proportions using ArcSine transformation

Variance Stabilization of

**Actual**Proportions using ArcSine transformation

It seems that the theoretical plots match up quite nicely with the actual plots. The points are between +/- 3 standard deviations (represented as the lighter shades of red), and the variance is pretty much uniform across the range of values (note: the variance seems to decrease slightly for larger values, but I am not sure if it just appears that way since the points are less dense at higher values).

Lets compare the above plots to the log-log plot below:

This log-log plot also shows that there is good correlation between the two samples, but it would be difficult to conclude whether or not the variability of read counts between samples can be accounted for by randomness of sampling alone. On the other hand, in the above plots, checking whether or not the points are within the expected variance range and if the variance is uniform across all values is a good indication if the data fit a Poisson distribution or not.

**Technical Replicates with Different Number of Total Counts**

I also compared technical replicates of Kidney that had different concentrations. In the above example, the concentrations were the same and the total number of reads are roughly the same. Therefore, looking at the transformation of the raw counts or the proportions could both be used. However, here we see an example where total read counts are not the same. This is where looking at proportions comes in handy, where we can look at the relative proportion of read counts rather than absolute values.

Below is a plot of the raw counts with the Anscombe transform. Notice that it veers off of the y=x line, as expected for samples with a different number of total reads.

Instead, we can look at proportions and use the arcsine transformation to stabilize variance:

It still looks like it veers off of the y=x line slightly, but it seems to work a little better than comparing the raw counts.

**Kidney versus Liver**

And just for fun, let's compare the kidney sample to the liver sample. Quite a lot of differences between the two!