Sunday, August 7, 2011

Sort FASTQ file by sequence

This tutorial will show you how to sort a FASTQ file by sequence. This is useful when you want to remove/count duplicate sequences for quality analysis or reduce the workload of the alignment stage. The BioStar post has some good suggestions for a similar task, although most would require the file to fit into memory, which is not a safe assumption for large next-gen data.

The FASTQ sort works by piping together 3 steps: (1) "linearize" the FASTQ files, (2) sort the lines using Unix sort, and (3) "de-linearize" the lines back into a sorted FASTQ file. You will need a Perl interpreter and the Unix sort command.

Linearize the FASTQ file

This perl script will take every 4 lines in the file and print them on one line, separated with tabs. There are a couple of assumptions: (1) The FASTQ sequence and quality fields do not wrap around to more than one line so that every 4 lines is indeed a FASTQ record, and (2) there are no tabs in the FASTQ files since we are using that as our delimiter.


#!/usr/bin/perl -w

#mergelines.pl

use strict;

my $count = 0;
while(my $line = <STDIN>){
    chomp($line);
    print $line;
    $count = ($count + 1) % 4;
    
    if($count == 0){
        print "\n"
    }else{
        print "\t";
    }
}



Save the file as mergelines.pl

Sort the sequences

We will use the Unix command

sort --stable -t $'\t' -k2,2

This says to sort the lines based on the second field (sequences) of the tab delimited file. I use the flag --stable so that records with the exact same sequence will be in the same relative order as they are in the original file. This is probably not necessary and if removed, the sort will most likely run faster. You can also specify the "-u" flag if you want only unique sequences.

De-linearize the file

After sorting, the FASTQ records need to be written back to the FASTQ format. The following script will split each line into four lines based on the tab delimiters.


#!/usr/bin/perl -w

#splitlines.pl


use strict;

my $count = 0;
while(my $line = <STDIN>){
    chomp($line);
    my @vals = split(/\t/, $line);
    
    print $vals[0]."\n";
    print $vals[1]."\n";
    print $vals[2]."\n";
    print $vals[3]."\n";    
}

Save the file as splitlines.pl

Run the program

The program works by streaming the input FASTQ file into the mergelines perl script, sorting the lines based on the sequence field, and splitting the lines using the splitlines perl script.

Assuming your FASTQ file is called test.fq, it would work as follows

cat test.fq | perl mergelines.pl | sort --stable -t $'\t' -k2,2 | perl splitlines.pl


Example


> head -12 e_coli_1000.fq
@r0
GAACGATACCCACCCAACTATCGCCATTCCAGCAT
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r1
CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r2
TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554

cat e_coli_1000.fq | perl mergelines.pl | sort --stable -t $'\t' -k2,2 | perl splitlines.pl > sorted.fq


> head -12 sorted.fq
@r97
AAAAACCAGTTTAAGTTGGTTATTTCTAATCGCTT
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r602
AAAAGTGAAGCCGAAAAACGCGTAATNAGGNCAAA
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r699
AAAATAATCCATGACATAATGCCCACTGTTTGNTC
+
EDCCCBAAAA@@@@?>===<;;9:99987776554

Bonus! Sorting SAM files with Unix

The next few lines show how to sort a SAM file based on sequence, read name, or genomic position.  The header (lines beginning wit "@") is removed, the remainder of the file is sorted, and the header and body are concatenated back together to form the sorted SAM file.

#Edit the next two lines with the name of the input SAM file and the desired name of the sorted output SAM file.
SAM=input.sam
OUTPUT=output.sam

#Sort by sequence
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k10,10 | cat header.txt - > $OUTPUT

#Sort by read name
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k1,1 | cat header.txt - > $OUTPUT

#Sort by genomic position
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k3,3 -k4,4n | cat header.txt - > $OUTPUT


3 comments:

  1. You can do away with the two Perl scripts and use the Unix commands "paste" and "tr" instead. It would work like this:

    paste - - - - < my.fastq | sort --stable -t $'\t' -k2,2 | tr '\t' '\n'

    The "paste" will put the 4 lines of each FASTQ record into a 4-column tab-delimited format. The "tr" converts tabs back to newlines and the standard FASTQ format. These will be much, much faster than your Perl script.

    In addition, your version of "sort" may support parallelisation and the allocation of more memory per process e.g. on my big-memory machine with 64 cores I could do this:

    paste - - - - < my.fastq | sort --parallel 20 --buffer-size 5G --stable -t $'\t' -k2,2 | tr '\t' '\n'

    ReplyDelete
    Replies
    1. In fact, I have a blog post about this here: http://nathanhaigh.github.io/linux/2014/11/14/Unix-paste/

      Delete
  2. Hi. What is the time for sorting a 100MB file?

    ReplyDelete