New AB initio methods of small genome sequence interpretation
Mills, Ryan Edward
MetadataShow full item record
This thesis presents novel methods for analysis of short viral sequences and identifying biologically significant regions based on their statistical properties. The first section of this thesis describes the ab initio method for identifying genes in viral genomes of varying type, shape and size. This method uses statistical models of the viral protein-coding and non-coding regions. We have created an interactive database summarizing the results of the application of this method to viral genomes currently available in GenBank. This database, called VIOLIN, provides an access to the genes identified for each viral genome, allows for further analysis of these gene sequences and the translated proteins, and displays graphically the distribution of protein-coding potential in a viral genome. The next two sections of this thesis describe individual projects for two specific viral genomes analyzed with the new method. The first project was devoted to the recently sequenced Herpes B virus from Rhesus macaque. This genome was initially thought to lack an ortholog of the gamma-34.5 gene encoding for a neurovirulence factor necessary for viability of the two close relatives, human herpes simplex viruses 1 and 2. The genome of Rhesus macaque Herpes B virus was annotated using the new gene finding procedure and an in-depth analysis was conducted to find a gamma-34.5 ortholog using a variety of tools for a similarity search. A profound similarity in codon usage between B virus and its host was also identified, despite the large difference in their GC contents (74% and 51%, respectively). The last thesis section describes the analysis of the Mouse Cytomegalovirus (MCMV) genome by the combination of methods such as sequence segmentation, gene finding and protein identification by mass spectrometry. The MCMV genome is a challenging subject for statistical sequence analysis due to the heterogeneity of its protein coding regions. Therefore the MCMV genome was segmented based on its nucleotide composition and then each segment was considered independently. A thorough analysis was conducted to identify previously unnoticed genes, incorrectly annotated genes and potential sequence errors causing frameshifts. All the findings were then corroborated by the mass spectrometry analysis.