There is an explosive growth in the amount of sequenced nucleotides of genomic DNA due to the newly developed biotechnology and various genome projects. Several million bases of genomic DNA are sequenced daily and made available to the public. It becomes crucial to analyze the data and characterize sequence content in a high-throughput computational way. To date, many gene-finding programs have been developed to annotate the newly sequenced genomes. This is an essential and important step in the works of genome annotation. Those programs are based on pattern recognition methods such as artificial neural networks [GRAIL (Xu et al., 1997), GeneParser (Snyder and Stormo, 1995)], discriminant analysis [GeneFinder (Solovyev et al., 1994), MZEF (Zhang, 1997)], and hidden Markov models [Genie (Kulp et al., 1996), GENSCAN (Burge and Karlin, 1997), HMMgene (Krogh, 1997)].
Although these gene-finding programs have reported high prediction accuracy in specific domains, we still lack a universal program that can report satisfactory accuracy in general cases. Researchers thus further verify the predictions of these programs by searching for similar homologues in the database, however, it has been already known that about 50% of newly discovered genes have no similar homologues in the protein sequence database (Uberbacher et al., 1996; Dunham et al., 1999). As such, improving the prediction accuracy of gene-finding programs is more important than the validation task thereafter. Moreover, most researchers strive to develop a new gene-finding program that can attain better prediction accuracy than the others, they ignore the fact that a gene-finding method may yield highly accurate predictions in a specific domain, but there is no single gene-finding approach which is the most appropriate gene predictor for all newly sequenced genomes. Even if a worse gene-finding program can correct part of the predictions produced by a novel gene predictor. For instance, when we annotate a 28,984 bp-long contig of human DNA sequence into exon and intron regions, GENSCAN can correctly identify gene structures from 28,430 bp of them. Among the 554 bp remaining sequences that are not well annotated by GENSCAN, more gene structures can be recognized by GeneView (366 bp) and HMMGene (454 bp), respectively. This reveals a good compensation among those programs in gene prediction. Therefore, a careful combination of multiple gene-finding programs is very likely to overcome any individual program. 数据挖掘研究院

