Locating protein coding regions in human DNA using a decision tree algorithm

S Salzberg

doi:10.1089/cmb.1995.2.473

Locating protein coding regions in human DNA using a decision tree algorithm

J Comput Biol. 1995 Fall;2(3):473-85. doi: 10.1089/cmb.1995.2.473.

Author

S Salzberg¹

Affiliation

¹ Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.

PMID: 8521276
DOI: 10.1089/cmb.1995.2.473

Abstract

Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

Publication types

Comparative Study
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Codon / genetics
DNA / classification
DNA / genetics*
Decision Trees*
Exons
Genes
Humans
Proteins / genetics*

Substances

Codon
Proteins
DNA