Efficient methods for read mapping
MetadataShow full item record
DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data. This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of the edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.