Enhance the understanding of whole-genome evolution by designing, accelerating and parallelizing phylogenetic algorithms
MetadataShow full item record
The advent of new technology enhance the speed and reduce the cost for sequencing biological data. Making biological sense of this genomic data is a big challenge to the algorithm design as well as the high performance computing society. There are many problems in Bioinformatics, such as how new functional genes arise, why genes are organized into chromosomes, how species are connected through the evolutionary tree of life, or why arrangements are subject to change. Phylogenetic analyses have become essential to research on the evolutionary tree of life. It can help us to track the history of species and the relationship between different genes or genomes through millions of years. One of the fundamentals for phylogenetic construction is the computation of distances between genomes. Since there are much more complicated combinatoric patterns in rearrangement events, the distance computation is still a hot topic as much belongs to mathematics as to biology. For the distance computation with input of two genomes containing unequal gene contents (with insertions/deletions and duplications) the problem is especially hard. In this thesis, we will discuss about our contributions to the distance estimation for unequal gene order data. The problem of finding the median of three genomes is the key process in building the most parsimonious phylogenetic trees from genome rearrangement data. For genomes with unequal contents, to the best of our knowledge, there is no algorithm that can help to find the median. In this thesis, we make our contributions to the median computation in two aspects. 1) Algorithm engineering aspect, we harness the power of streaming graph analytics methods to implement an exact DCJ median algorithm which run as fast as the heuristic algorithm and can help construct a better phylogenetic tree. 2) Algorithmic aspect, we theoretically formulate the problem of finding median with input of genomes having unequal gene content, which leads to the design and implementation of an efficient Lin-Kernighan heuristic based median algorithm. Inferring phylogenies (evolutionary history) of a set of given species is the ultimate goal when the distance and median model are chosen. For more than a decade, biologists and computer scientists have studied how to infer phylogenies by the measurement of genome rearrangement events using gene order data. While evolution is not an inherently parsimonious process, maximum parsimony (MP) phylogenetic analysis has been supported by widely applied to the phylogeny inference to study the evolutionary patterns of genome rearrangements. There are generally two problems with the MP phylogenetic arose by genome rearrangement: One is, given a set of modern genomes, how to compute the topologies of the according phylogenetic tree; Another is, given the topology of a model tree, how to infer the gene orders of the ancestor species. To assemble a MP phylogenetic tree constructor, there are multiple NP hard problems involved, unfortunately, they organized as one problem on top of other problems. Which means, to solve a NP hard problem, we need to solve multiple NP hard sub-problems. For phylogenetic tree construction with the input of unequal content genomes, there are three layers of NP hard problems. In this thesis, we will mainly discuss about our contributions to the design and implementation of the software package DCJUC (Phylogeny Inference using DCJ model to cope with Unequal Content Genomes), that can help to achieve both of these two goals. Aside from the biological problems, another issue we need to concern is about the use of the power of parallel computing to assist accelerating algorithms to handle huge data sets, such as the high resolution gene order data. For one thing, all of the method to tackle with phylogenetic problems are based on branch and bound algorithms, which are quite irregular and unfriendly to parallel computing. To parallelize these algorithms, we need to properly enhance the efficiency for localized memory access and load balance methods to make sure that each thread can put their potentials into full play. For the other, there is a revolution taking place in computing with the availability of commodity graphical processors such as Nvidia GPU and with many-core CPUs such as Cray-XMT, or Intel Xeon Phi Coprocessor with 60 cores. These architectures provide a new way for us to achieve high performance at much lower cost. However, code running on these machines are not so easily programmed, and scientific computing is hard to tune well on them. We try to explore the potentials of these architectures to help us accelerate branch and bound based phylogenetic algorithms.