High Performance Algorithms for K-mer Counting and Genomic Read Overlap Finding
MetadataShow full item record
Advancements in genomics are enabling a deeper understanding of how human body works and bringing us a new era of personalized healthcare. Genome sequencing and genome assembly are important procedures that facilitate genomic understanding whose quality affect nearly all downstream analysis. Modern high throughput sequencers could generate a few billion DNA subsequences with high accuracy in one experiment, proposing challenges to existing genome assembly pipelines that could only processes millions of reads per week. For the de Bruijn graph (DBG) based pipelines whose running time do not increase much with the number of input reads, efficient k-mer counting that is stable under the presence of repeats can serve as a useful heuristic for genome reconstruction. For Overlap-Layout-Concensus (OLC) pipelines, efficient sequence overlap finding is the first and most computationally intensive step in the process that must be improved to accommodate the huge amount of reads. To address these problems, this work presents a new de novo k-mer counting method that utilizes read pairs double matching, and a new approach to constructing the sequence overlap graphs based on locality sensitive hashing (LSH). We provide theoretical performance estimation of the latter method, followed by experimental verifications on simulated read data (large dataset contains about 1.25 billion reads). Our approach is the first overlap finder that could construct billion-scale overlap graphs. The experiment results for the de novo k-mer counting algorithm indicate it can provide a quite robust upper bound of k-mer occurrences. The experiment results for the overlap finding method suggest that our approach is at least 24 times faster (on the billion node size overlapping graph) and more flexible than one of the most influential overlap finders currently available, Minimap.