Accelerated variant curation from scientific literature using biomedical text mining

Rishab Mallick; Valerio Arnaboldi; Paul Davis; Stavros Diamantakis; Magdalena Zarowiecki; Kevin Howe

doi:10.17912/micropub.biology.000578

Accelerated variant curation from scientific literature using biomedical text mining

MicroPubl Biol. 2022 Jun 1:2022:10.17912/micropub.biology.000578. doi: 10.17912/micropub.biology.000578. eCollection 2022.

Authors

Rishab Mallick¹, Valerio Arnaboldi², Paul Davis¹, Stavros Diamantakis¹, Magdalena Zarowiecki¹, Kevin Howe¹

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
² Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA.

Abstract

Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers.

Grants and funding

U24 HG002223/HG/NHGRI NIH HHS/United States