Monday, October 15, 2012

Fun with Bacteria

Recently, I've taken an interest in metagenomics, which involves identifying micro-organisms directly from environmental samples (such as the ocean, soil, human gut, and even the belly button).  The identification of the organisms can be accomplished by reading the DNA within the environmental sample and determining which type of organism the DNA came from.  The advantage of this approach is that you can study micro-organisms that cannot be easily cultured in the laboratory, but the disadvantage is that all the DNA is mixed together, and it is a challenge to identify which species are represented.

One approach for identifying different varieties of bacteria is to look at a specific 1,500 length sequence referred to as 16S ribosomal RNA.  This sequence is ubiquitous in bacteria but various enough that different species of bacteria have slightly different sequences.  These sequences can be captured from a sample and read, and used as barcodes to identify what types and the abundances of bacteria are present.

I thought it would be an interesting exercise to plot the dissimilarities of these sequences.  Sequences that are closer to each other would indicate species that have diverged from each other more recently.  I was curious to see the relationships among these different bacteria species.  Data was obtained from the Human Oral Microbiome Database.   I used multidimensional scaling to plot the data in a low dimensional space that could be visualized.  The metric I used to calculate distances between sequences was the Levenshtein (edit) distance, which is the minimum number of edits (substitutions, deletions and insertions) to transfer one sequence into the other.

I color-coded the data points by the taxonomic ranks (Domain, Phylum, Class, Order, Family, Genus).  As one would expect, species that are in similar ranks tend to be closer together on the plots.  There seems to be three main clusters: (1) the Bacteroidetes, (2) the Spirochaetes, and (3) everything else.  The everything else group seems to contain clusters too, just not as separated as the main 3.  From the 3D plot, you can see that the Proteobacteria (Betaproteobacteria and Gammaproteobacteria) separate relatively well from cluster 3.

Here is the MDS plot colored by Class (2D):

and also 3D:

Also, here is the same plot colored by some of the other taxonomic ranks


* plots generated in Partek Genomics Suite

R code used to calculate multi-dimensional scaling:

ribo <- read.table("data.txt", header=TRUE)
fit <- cmdscale(ribo, eig=TRUE, k=3)
x <- fit$points[,1]
y <- fit$points[,2]
z <- fit$points[,3]

No comments:

Post a Comment