University of Florida
Bioinformatics

Brocchieri Lab
McIntyre Lab
Riva Lab

Gene Finding with High G+C Content

Analysis of genes and genomes of prokaryotic origin has shown that global G+C content correlates distinctively with the G+C content at the first, second and third codon position (Bibb et al 1984, Muto and Osawa 1987). Coding regions in genomes of high G+C content are generally distinguished by strong asymmetries in the G+C content of codon-base positions 1, 2 and 3, that can be exploited to detect coding regions by frame analysis (Bibb et al. 1984). The frequencies of G+C in three frames are calculated with respect to every third position within a moving window of predefined size (e.g., 201 nt). The G+C content variations along the genome determined over positions in frame with nucleotide 1, 2 or 3 of the complete genome are represented by three curves referred to as "S-profiles" (Brocchieri et al. 2005) that allow visualization of the position of coding sequences. Based on the expected distribution of G+C over coding positions given the total G+C content, I have developed a measure of bias in usage of G+C in coding sequences and a related measure of coding potential that quantifies the likelihood of any given genome position to be coding in each coding frame (Brocchieri et al. 2005). The application of this procedure in conjunction with information on conservation between two related species of rodent cytomegalovirus (mouse CMV and rat CMV) has revealed that the method can detect the presence of genes when the G+C content is as low as 50-55%, leading to the identification of many previously unrecognized coding regions in these genomes as well as previously mis-annotated genes (Brocchieri et al. 2005). This method will be applied to identify new coding regions and to verify previously annotated coding regions in the related species of human cytomegalovirus (HCMV). S-profile analysis will also be applied to gene prediction in many high G+C prokaryotic genomes. In this respect, preliminary results on the genome of Myxococcus xanthus (G+C = 67.5%) reveal more than 120 genes not previously annotated that are clearly identified by S-profile analysis. A computer procedure that automatically predicts coding regions based on S-profile information is under development.

Publications supported by this project: