Transformer versus traditional natural language processing: how much data is enough for automated radiology report classification?

Eric Yang; Matthew D Li; Shruti Raghavan; Francis Deng; Min Lang; Marc D Succi; Ambrose J Huang; Jayashree Kalpathy-Cramer

doi:10.1259/bjr.20220769

Transformer versus traditional natural language processing: how much data is enough for automated radiology report classification?

Br J Radiol. 2023 Sep;96(1149):20220769. doi: 10.1259/bjr.20220769. Epub 2023 May 25.

Authors

Eric Yang^{1

2}, Matthew D Li³, Shruti Raghavan², Francis Deng², Min Lang², Marc D Succi², Ambrose J Huang², Jayashree Kalpathy-Cramer⁴

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
³ Department of Radiology and Diagnostic Imaging, University of Alberta, Edmonton, Alberta, Canada.
⁴ Department of Ophthalmology, University of Colorado, Aurora, CO, USA.

Abstract

Objectives: Current state-of-the-art natural language processing (NLP) techniques use transformer deep-learning architectures, which depend on large training datasets. We hypothesized that traditional NLP techniques may outperform transformers for smaller radiology report datasets.

Methods: We compared the performance of BioBERT, a deep-learning-based transformer model pre-trained on biomedical text, and three traditional machine-learning models (gradient boosted tree, random forest, and logistic regression) on seven classification tasks given free-text radiology reports. Tasks included detection of appendicitis, diverticulitis, bowel obstruction, and enteritis/colitis on abdomen/pelvis CT reports, ischemic infarct on brain CT/MRI reports, and medial and lateral meniscus tears on knee MRI reports (7,204 total annotated reports). The performance of NLP models on held-out test sets was compared after training using the full training set, and 2.5%, 10%, 25%, 50%, and 75% random subsets of the training data.

Results: In all tested classification tasks, BioBERT performed poorly at smaller training sample sizes compared to non-deep-learning NLP models. Specifically, BioBERT required training on approximately 1,000 reports to perform similarly or better than non-deep-learning models. At around 1,250 to 1,500 training samples, the testing performance for all models began to plateau, where additional training data yielded minimal performance gain.

Conclusions: With larger sample sizes, transformer NLP models achieved superior performance in radiology report binary classification tasks. However, with smaller sizes (<1000) and more imbalanced training data, traditional NLP techniques performed better.

Advances in knowledge: Our benchmarks can help guide clinical NLP researchers in selecting machine-learning models according to their dataset characteristics.

MeSH terms

Humans
Machine Learning
Magnetic Resonance Imaging
Natural Language Processing*
Radiology*
Tomography, X-Ray Computed / methods

Grants and funding

P41 EB015896/EB/NIBIB NIH HHS/United States