Using substitution probabilities to improve position-specific scoring matrices

J G Henikoff; S Henikoff

doi:10.1093/bioinformatics/12.2.135

Using substitution probabilities to improve position-specific scoring matrices

Comput Appl Biosci. 1996 Apr;12(2):135-43. doi: 10.1093/bioinformatics/12.2.135.

Authors

J G Henikoff¹, S Henikoff

Affiliation

¹ Howard Hughes Medical Institute, Basic Sciences Division, Seattle, WA 98104, USA. henikoff@howard.fhcrc.org

PMID: 8744776
DOI: 10.1093/bioinformatics/12.2.135

Abstract

Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial 'pseudo-counts' to the observed counts. We introduce a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this position-based method out-performed other pseudo-count methods and was a substantial improvement over the traditional average score method used for constructing profiles.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Amino Acid Sequence
Computers
Databases, Factual
Evaluation Studies as Topic
Odds Ratio
Probability
Proteins / chemistry
Proteins / genetics
Sequence Alignment / methods*
Sequence Alignment / statistics & numerical data

Substances

Proteins

Grants and funding

GM 29009/GM/NIGMS NIH HHS/United States