tag:blogger.com,1999:blog-1972320213008894035.post5703826686013034622..comments2023-10-20T06:28:04.206-07:00Comments on Bridgecrest Bioinformatics: Sort FASTQ file by sequenceJustinhttp://www.blogger.com/profile/09654281717195682782noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-1972320213008894035.post-77857623490158316242015-11-12T02:08:02.474-08:002015-11-12T02:08:02.474-08:00Hi. What is the time for sorting a 100MB file?Hi. What is the time for sorting a 100MB file?Guruhttps://www.blogger.com/profile/03892743885714024602noreply@blogger.comtag:blogger.com,1999:blog-1972320213008894035.post-43965228690606015272015-02-10T15:23:54.014-08:002015-02-10T15:23:54.014-08:00In fact, I have a blog post about this here: http:...In fact, I have a blog post about this here: http://nathanhaigh.github.io/linux/2014/11/14/Unix-paste/Nathan Watson-Haighhttps://www.blogger.com/profile/04644217548765097983noreply@blogger.comtag:blogger.com,1999:blog-1972320213008894035.post-69838471361930571442015-02-10T15:22:19.563-08:002015-02-10T15:22:19.563-08:00You can do away with the two Perl scripts and use ...You can do away with the two Perl scripts and use the Unix commands "paste" and "tr" instead. It would work like this:<br /><br />paste - - - - < my.fastq | sort --stable -t $'\t' -k2,2 | tr '\t' '\n'<br /><br />The "paste" will put the 4 lines of each FASTQ record into a 4-column tab-delimited format. The "tr" converts tabs back to newlines and the standard FASTQ format. These will be much, much faster than your Perl script.<br /><br />In addition, your version of "sort" may support parallelisation and the allocation of more memory per process e.g. on my big-memory machine with 64 cores I could do this:<br /><br />paste - - - - < my.fastq | sort --parallel 20 --buffer-size 5G --stable -t $'\t' -k2,2 | tr '\t' '\n'<br /><br />Nathan Watson-Haighhttps://www.blogger.com/profile/04644217548765097983noreply@blogger.com