Use of Natural Language Processing to Improve Identification of Patients With Peripheral Artery Disease

E Hope Weissler; Jikai Zhang; Steven Lippmann; Shelley Rusincovitch; Ricardo Henao; W Schuyler Jones

doi:10.1161/CIRCINTERVENTIONS.120.009447

Use of Natural Language Processing to Improve Identification of Patients With Peripheral Artery Disease

Circ Cardiovasc Interv. 2020 Oct;13(10):e009447. doi: 10.1161/CIRCINTERVENTIONS.120.009447. Epub 2020 Oct 12.

Authors

E Hope Weissler¹, Jikai Zhang², Steven Lippmann³, Shelley Rusincovitch⁴, Ricardo Henao^{2

4}, W Schuyler Jones^{3

5}

Affiliations

¹ Division of Vascular and Endovascular Surgery (E.H.W.), Duke University School of Medicine, Durham, NC.
² Department of Biostatistics and Bioinformatics (J.Z., R.H.), Duke University School of Medicine, Durham, NC.
³ Department of Population Health Sciences (S.L., W.S.J.), Duke University School of Medicine, Durham, NC.
⁴ Duke Forge (S.R., R.H.), Duke University School of Medicine, Durham, NC.
⁵ Division of Cardiology (W.S.J.), Duke University School of Medicine, Durham, NC.

PMID: 33040585
PMCID: PMC7577538
DOI: 10.1161/CIRCINTERVENTIONS.120.009447

Abstract

Background: Peripheral artery disease (PAD) is underrecognized, undertreated, and understudied: each of these endeavors requires efficient and accurate identification of patients with PAD. Currently, PAD patient identification relies on diagnosis/procedure codes or lists of patients diagnosed or treated by specific providers in specific locations and ways. The goal of this research was to leverage natural language processing to more accurately identify patients with PAD in an electronic health record system compared with a structured data-based approach.

Methods: The clinical notes from a cohort of 6861 patients in our health system whose PAD status had previously been adjudicated were used to train, test, and validate a natural language processing model using 10-fold cross-validation. The performance of this model was described using the area under the receiver operating characteristic and average precision curves; its performance was quantitatively compared with an administrative data-based least absolute shrinkage and selection operator (LASSO) approach using the DeLong test.

Results: The median (SD) of the area under the receiver operating characteristic curve for the natural language processing model was 0.888 (0.009) versus 0.801 (0.017) for the LASSO-based approach alone (DeLong P<0.0001). The median (SD) of the area under the precision curve was 0.909 (0.008) versus 0.816 (0.012) for the structured data-based approach. When sensitivity was set at 90%, the precision for LASSO was 65% and the machine learning approach was 74%, while the specificity for LASSO was 41% and for the machine learning approach was 62%.

Conclusions: Using a natural language processing approach in addition to partial cohort preprocessing with a LASSO-based model, we were able to meaningfully improve our ability to identify patients with PAD compared with an approach using structured data alone. This model has potential applications to both interventions targeted at improving patient care as well as efficient, large-scale PAD research. Graphic Abstract: A graphic abstract is available for this article.

Keywords: cohort studies; electronic health records; machine learning; natural language processing; peripheral artery disease.

Publication types

Comparative Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Validation Study

MeSH terms

Aged
Aged, 80 and over
Amputation, Surgical
Ankle Brachial Index
Data Mining*
Diagnosis, Computer-Assisted*
Electronic Health Records
Endovascular Procedures
Female
Humans
Male
Middle Aged
Natural Language Processing*
Peripheral Arterial Disease / diagnosis*
Peripheral Arterial Disease / diagnostic imaging
Peripheral Arterial Disease / therapy
Predictive Value of Tests
Reproducibility of Results
Vascular Surgical Procedures

Abstract

Publication types

MeSH terms

Grants and funding