Show simple item record

dc.contributor.advisorAluru, Srinivas
dc.contributor.authorFlick, Patrick
dc.date.accessioned2019-05-29T14:03:12Z
dc.date.available2019-05-29T14:03:12Z
dc.date.created2019-05
dc.date.issued2019-03-29
dc.date.submittedMay 2019
dc.identifier.urihttp://hdl.handle.net/1853/61257
dc.description.abstractMethods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes - which for the last 10 years have greatly out-paced advances in processor speeds. Processing and analyzing state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Suffix arrays and trees are fundamental string data structures, which lie at the foundation of many string algorithms, with important applications in text processing, information retrieval, and computational biology. Conversely, the parallel construction of these indices is an actively studied problem. However, prior approaches lacked good worst-case run-time guarantees and exhibit poor scaling and overall performance. In this work, we present our distributed-memory parallel algorithms for indexing large datasets, including algorithms for the distributed construction of suffix arrays, LCP arrays, and suffix trees. We formulate a generalized version of the All-Nearest-Smaller-Values problem, provide an optimal distributed solution, and apply it to the distributed construction of suffix trees - yielding a work-optimal parallel algorithm. Our algorithms for distributed suffix array and suffix tree construction improve the state-of-the-art by simultaneously improving worst-case run-time bounds and achieving superior practical performance. Next, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA) - based on the suffix and LCP arrays, the DESA consists of these and additional distributed data structures. The DESA is designed to allow efficient pattern search queries in distributed memory while requiring at most O(n/p) memory per process. We present efficient distributed-memory parallel algorithms for querying, as well as for the efficient construction of this distributed index. Finally, we present our work on distributed-memory algorithms for clustering de Bruijn graphs and its application to solving a grand challenge metagenomic dataset.
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.publisherGeorgia Institute of Technology
dc.subjectDistributed memory
dc.subjectSuffix array
dc.subjectSuffix tree
dc.subjectEnhanced suffix arrays
dc.subjectDe Bruijn graphs
dc.subjectDistributed graphs
dc.subjectConnected components
dc.titleParallel and scalable combinatorial string algorithms on distributed memory systems
dc.typeDissertation
dc.description.degreePh.D.
dc.contributor.departmentComputational Science and Engineering
thesis.degree.levelDoctoral
dc.contributor.committeeMemberCatalyurek, Umit V.
dc.contributor.committeeMemberVuduc, Richard W.
dc.contributor.committeeMemberYalamanchili, Sudhakar
dc.contributor.committeeMemberUcar, Bora
dc.date.updated2019-05-29T14:03:12Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record