Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

A A Schäffer; L Aravind; T L Madden; S Shavirin; J L Spouge; Y I Wolf; E V Koonin; S F Altschul

doi:10.1093/nar/29.14.2994

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994.

Authors

A A Schäffer¹, L Aravind, T L Madden, S Shavirin, J L Spouge, Y I Wolf, E V Koonin, S F Altschul

Affiliation

¹ National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA. schaffer@helix.nih.gov

Abstract

PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.

Publication types

Review

MeSH terms

Algorithms
Amino Acids / genetics
Animals
Computational Biology / methods
Computational Biology / statistics & numerical data
Databases, Factual*
Humans
Information Storage and Retrieval
Proteins / genetics*
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment / methods*
Software*

Substances

Amino Acids
Proteins