Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries

Bioinformatics. 1999 Jul-Aug;15(7-8):528-35. doi: 10.1093/bioinformatics/15.7.528.

Abstract

Motivation: Computer-based selection of entries from sequence databases with respect to a related functional description, e.g. with respect to a common cellular localization or contributing to the same phenotypic function, is a difficult task. Automatic semantic analysis of annotations is not only hampered by incomplete functional assignments. A major problem is that annotations are written in a rich, non-formalized language and are meant for reading by a human expert. This person can extract from the text considerably more information than is immediately apparent due to his extended biological background knowledge and logical reasoning.

Approach: A technique of automated annotation evaluation based on a combination of lexical analysis and the usage of biological rule libraries has been developed. The proposed algorithm generates new functional descriptors from the annotation of a given entry using the semantic units of the annotation as prepositions for implications executed in accordance with the rule library.

Results: The prototype of a software system, the Meta_A(nnotator) program, is described and the results of its application to sequence attribute assignment and sequence selection problems, such as cellular localization and sequence domain annotation of SWISS-PROT entries, are presented. The current software version assigns useful subcellular localization qualifiers to approximately 88% of all SWISS-PROT entries. As shown by demonstrative examples, the combination of sequence and annotation analysis is a powerful approach for the detection of mutual annotation/sequence inconsistencies.

Availability: Results for the cellular localization assignment can be viewed at the URL http://www.bork. embl-heidelberg.de/CELL_LOC/CELL_LOC.html.

MeSH terms

  • Algorithms
  • Animals
  • Databases, Factual*
  • Genomic Library*
  • Humans
  • Molecular Sequence Data
  • Protein Structure, Tertiary / genetics
  • Proteins / genetics
  • Software
  • Subcellular Fractions / chemistry

Substances

  • Proteins