The FASTQ sort works by piping together 3 steps: (1) "linearize" the FASTQ files, (2) sort the lines using Unix sort, and (3) "de-linearize" the lines back into a sorted FASTQ file. You will need a Perl interpreter and the Unix sort command.
Linearize the FASTQ file
This perl script will take every 4 lines in the file and print them on one line, separated with tabs. There are a couple of assumptions: (1) The FASTQ sequence and quality fields do not wrap around to more than one line so that every 4 lines is indeed a FASTQ record, and (2) there are no tabs in the FASTQ files since we are using that as our delimiter.#!/usr/bin/perl -w
#mergelines.pl
use strict;
my $count = 0;
while(my $line = <STDIN>){
chomp($line);
print $line;
$count = ($count + 1) % 4;
if($count == 0){
print "\n";
}else{
print "\t";
}
}
Save the file as mergelines.pl
Sort the sequences
We will use the Unix commandsort --stable -t $'\t' -k2,2
This says to sort the lines based on the second field (sequences) of the tab delimited file. I use the flag --stable so that records with the exact same sequence will be in the same relative order as they are in the original file. This is probably not necessary and if removed, the sort will most likely run faster. You can also specify the "-u" flag if you want only unique sequences.
De-linearize the file
After sorting, the FASTQ records need to be written back to the FASTQ format. The following script will split each line into four lines based on the tab delimiters.#!/usr/bin/perl -w
#splitlines.pl
use strict;
my $count = 0;
while(my $line = <STDIN>){
chomp($line);
my @vals = split(/\t/, $line);
print $vals[0]."\n";
print $vals[1]."\n";
print $vals[2]."\n";
print $vals[3]."\n";
}
Save the file as splitlines.pl
Run the program
The program works by streaming the input FASTQ file into the mergelines perl script, sorting the lines based on the sequence field, and splitting the lines using the splitlines perl script.Assuming your FASTQ file is called test.fq, it would work as follows
cat test.fq | perl mergelines.pl | sort --stable -t $'\t' -k2,2 | perl splitlines.pl
Example
> head -12 e_coli_1000.fq
@r0
GAACGATACCCACCCAACTATCGCCATTCCAGCAT
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r1
CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r2
TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
cat e_coli_1000.fq | perl mergelines.pl | sort --stable -t $'\t' -k2,2 | perl splitlines.pl > sorted.fq
> head -12 sorted.fq
@r97
AAAAACCAGTTTAAGTTGGTTATTTCTAATCGCTT
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r602
AAAAGTGAAGCCGAAAAACGCGTAATNAGGNCAAA
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r699
AAAATAATCCATGACATAATGCCCACTGTTTGNTC
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
Bonus! Sorting SAM files with Unix
The next few lines show how to sort a SAM file based on sequence, read name, or genomic position. The header (lines beginning wit "@") is removed, the remainder of the file is sorted, and the header and body are concatenated back together to form the sorted SAM file.#Edit the next two lines with the name of the input SAM file and the desired name of the sorted output SAM file.
SAM=input.sam
OUTPUT=output.sam
#Sort by sequence
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k10,10 | cat header.txt - > $OUTPUT
#Sort by read name
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k1,1 | cat header.txt - > $OUTPUT
#Sort by genomic position
awk '/^@@*/' $SAM > header.txt
awk '!/^@@*/' $SAM | sort --stable -t $'\t' -k3,3 -k4,4n | cat header.txt - > $OUTPUT
You can do away with the two Perl scripts and use the Unix commands "paste" and "tr" instead. It would work like this:
ReplyDeletepaste - - - - < my.fastq | sort --stable -t $'\t' -k2,2 | tr '\t' '\n'
The "paste" will put the 4 lines of each FASTQ record into a 4-column tab-delimited format. The "tr" converts tabs back to newlines and the standard FASTQ format. These will be much, much faster than your Perl script.
In addition, your version of "sort" may support parallelisation and the allocation of more memory per process e.g. on my big-memory machine with 64 cores I could do this:
paste - - - - < my.fastq | sort --parallel 20 --buffer-size 5G --stable -t $'\t' -k2,2 | tr '\t' '\n'
In fact, I have a blog post about this here: http://nathanhaigh.github.io/linux/2014/11/14/Unix-paste/
DeleteHi. What is the time for sorting a 100MB file?
ReplyDelete