Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes
MetadataShow full item record
Prokaryotic gene-prediction is the task of finding genes in archaeal or bacterial DNA sequences. These genomes consist of alternating gene-coding and non-coding regions, meaning the task is solved by determining the start and end points of each gene in the DNA sequence, with gene-start prediction generally considered to be more difficult. The primary focus of this work is to improve gene-start prediction accuracy and our understanding of the biological translation-initiation mechanisms used to mark and determine gene-starts. There are two challenges that characterize this task. First, ground-truth, experimentally verified gene-starts are only available for a very small set of genes, and second, our knowledge of translation-initiation mechanisms is incomplete and quite often misleading. Three motivating questions arise from these challenges and are addressed in this work. First, how can we predict gene-starts in a DNA sequence without relying on ground-truth data and without any prior biological knowledge of that species? I show how simplifying assumptions about translation-initiation mechanisms biased the design of existing gene-finder algorithms hindering their predictive performance. I present GeneMarkS-2, an algorithm that relaxes those assumptions and learns more accurate representations of these mechanisms, thereby achieving more accurate predictions. Using it, I provide an updated view of the diversity of translation-initiation mechanisms across the prokaryotic domain. GeneMarkS-2 is now used by the National Center for Biotechnology Information (NCBI) to annotate their database of more than two hundred thousand prokaryotic genomes. Second, how can we measure the accuracy of gene-start prediction without access to ground-truth data? I show that the accuracy of existing methods measured on the limited set of verified data does not generalize to the much larger and more diverse set of available genes. This proves that these benchmark sets of verified starts are not representative enough for this task. I describe an alternative method to boost prediction performance for genes outside the ground-truth set by effectively filtering low-certainty predictions. This is done by only selecting gene-start predictions that are corroborated by multiple, independent sources of evidence. As part of this approach, I propose StartLink, a new comparative genomics approach for gene-start prediction; that is, comparing DNA fragments from multiple species rather than relying solely on a single genome. Third, how can we predict gene-starts for metagenomes, i.e. cases where frequently only part of the DNA sequence is available? Here, I describe how the mechanisms for gene-start prediction developed for GeneMarkS-2 can be ported to metagenomes, which often have short DNA fragments that hinder the performance of predictive methods. I present MetaGeneMarkS, and show that it achieves accuracies on metagenomes close to those achieved by GeneMarkS-2 on fully-sequenced DNA. Several recurring themes appear throughout this work. Understanding the limits of our knowledge of translation-initiation mechanisms proves essential to designing better models and provides an open field of new exploration of the diversity of these mechanisms. Furthermore, our unhealthy dependence on verified gene-starts for measuring performance has and continues to prevent us from accurately portraying the quality of our predictors, despite the >95% average accuracy levels measured on this set. It is therefore critical to restate that gene-start prediction is still an open problem.