Discriminating between empirical studies and nonempirical works using automated text classification

Alexis Langlois; Jian-Yun Nie; James Thomas; Quan Nha Hong; Pierre Pluye

doi:10.1002/jrsm.1317

Discriminating between empirical studies and nonempirical works using automated text classification

Res Synth Methods. 2018 Dec;9(4):587-601. doi: 10.1002/jrsm.1317. Epub 2018 Aug 29.

Authors

Alexis Langlois¹, Jian-Yun Nie¹, James Thomas², Quan Nha Hong³, Pierre Pluye³

Affiliations

¹ Département d'informatique et de recherche opérationnelle, Université de Montréal, Montréal, Canada.
² EPPI-Centre, University College London Institute of Education, London, UK.
³ Family Medicine, McGill University, Montréal, Canada.

PMID: 30103261
DOI: 10.1002/jrsm.1317

Abstract

Objective: Identify the most performant automated text classification method (eg, algorithm) for differentiating empirical studies from nonempirical works in order to facilitate systematic mixed studies reviews.

Methods: The algorithms were trained and validated with 8050 database records, which had previously been manually categorized as empirical or nonempirical. A Boolean mixed filter developed for filtering MEDLINE records (title, abstract, keywords, and full texts) was used as a baseline. The set of features (eg, characteristics from the data) included observable terms and concepts extracted from a metathesaurus. The efficiency of the approaches was measured using sensitivity, precision, specificity, and accuracy.

Results: The decision trees algorithm demonstrated the highest performance, surpassing the accuracy of the Boolean mixed filter by 30%. The use of full texts did not result in significant gains compared with title, abstract, keywords, and records. Results also showed that mixing concepts with observable terms can improve the classification.

Significance: Screening of records, identified in bibliographic databases, for relevant studies to include in systematic reviews can be accelerated with automated text classification.

Keywords: automated text classification; decision tree; health care; research method; support vector machine; systematic review.

MeSH terms

Algorithms
Bayes Theorem
Data Mining / methods
Databases, Bibliographic*
Humans
Information Storage and Retrieval / methods*
Information Storage and Retrieval / standards
Models, Statistical
Pattern Recognition, Automated
Reference Standards
Research Design*
Search Engine
Sensitivity and Specificity
Subject Headings
Support Vector Machine
Systematic Reviews as Topic

Grants and funding

MR/J005037/1/MRC_/Medical Research Council/United Kingdom