Sensitive and error-tolerant annotation of protein-coding DNA with BATH

bioRxiv [Preprint]. 2024 Jan 1:2023.12.31.573773. doi: 10.1101/2023.12.31.573773.

Abstract

We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long read sequencing data and in the context of pseudogenes.

Keywords: Frameshift Mutations; Genome Annotation; Profile Hidden Markov Models; Sequence Alignment.

Publication types

  • Preprint