• Login
    View Item 
    •   SMARTech Home
    • Undergraduate Research Opportunities Program (UROP)
    • Undergraduate Research Option Theses
    • View Item
    •   SMARTech Home
    • Undergraduate Research Opportunities Program (UROP)
    • Undergraduate Research Option Theses
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    High Performance Algorithms for K-mer Counting and Genomic Read Overlap Finding

    Thumbnail
    View/Open
    ZHU-UNDERGRADUATERESEARCHOPTIONTHESIS-2017.pdf (615.4Kb)
    Date
    2017-12
    Author
    Zhu, Shaowei
    Metadata
    Show full item record
    Abstract
    Advancements in genomics are enabling a deeper understanding of how human body works and bringing us a new era of personalized healthcare. Genome sequencing and genome assembly are important procedures that facilitate genomic understanding whose quality affect nearly all downstream analysis. Modern high throughput sequencers could generate a few billion DNA subsequences with high accuracy in one experiment, proposing challenges to existing genome assembly pipelines that could only processes millions of reads per week. For the de Bruijn graph (DBG) based pipelines whose running time do not increase much with the number of input reads, efficient k-mer counting that is stable under the presence of repeats can serve as a useful heuristic for genome reconstruction. For Overlap-Layout-Concensus (OLC) pipelines, efficient sequence overlap finding is the first and most computationally intensive step in the process that must be improved to accommodate the huge amount of reads. To address these problems, this work presents a new de novo k-mer counting method that utilizes read pairs double matching, and a new approach to constructing the sequence overlap graphs based on locality sensitive hashing (LSH). We provide theoretical performance estimation of the latter method, followed by experimental verifications on simulated read data (large dataset contains about 1.25 billion reads). Our approach is the first overlap finder that could construct billion-scale overlap graphs. The experiment results for the de novo k-mer counting algorithm indicate it can provide a quite robust upper bound of k-mer occurrences. The experiment results for the overlap finding method suggest that our approach is at least 24 times faster (on the billion node size overlapping graph) and more flexible than one of the most influential overlap finders currently available, Minimap.
    URI
    http://hdl.handle.net/1853/60345
    Collections
    • School of Computer Science Undergraduate Research Option Theses [205]
    • Undergraduate Research Option Theses [862]

    Browse

    All of SMARTechCommunities & CollectionsDatesAuthorsTitlesSubjectsTypesThis CollectionDatesAuthorsTitlesSubjectsTypes

    My SMARTech

    Login

    Statistics

    View Usage StatisticsView Google Analytics Statistics
    facebook instagram twitter youtube
    • My Account
    • Contact us
    • Directory
    • Campus Map
    • Support/Give
    • Library Accessibility
      • About SMARTech
      • SMARTech Terms of Use
    Georgia Tech Library266 4th Street NW, Atlanta, GA 30332
    404.894.4500
    • Emergency Information
    • Legal and Privacy Information
    • Human Trafficking Notice
    • Accessibility
    • Accountability
    • Accreditation
    • Employment
    © 2020 Georgia Institute of Technology