Gene finding in eukaryotic genomes using external information and machine learning techniques
Burns, Paul D.
MetadataShow full item record
Gene finding in eukaryotic genomes is an essential part of a comprehensive approach to modern systems biology. Most methods developed in the past rely on a combination of computational prediction and external information about gene structures from transcript sequences and comparative genomics. In the past, external sequence information consisted of a combination of full-length cDNA and expressed sequence tag (EST) sequences. Much improvement in prediction of genes and gene isoforms is promised by availability of RNA-seq data. However, productive use of RNA-seq for gene prediction has been difficult due to challenges associated with mapping RNA-seq reads which span splice junctions to prevalent splicing noise in the cell. This work addresses this difficulty with the development of methods and implementation of two new pipelines: 1/ a novel pipeline for accurate mapping of RNA-seq reads to compact genomes and 2/ a pipeline for prediction of genes using the RNA-seq spliced alignments in eukaryotic genomes. Machine learning methods are employed in order to overcome errors associated with the process of mapping short RNA-seq reads across introns and using them for determining sequence model parameters for gene prediction. In addition to the development of these new methods, genome annotation work was performed on several plant genome projects.