Background: Peripheral artery disease (PAD) is underrecognized, undertreated, and understudied: each of these endeavors requires efficient and accurate identification of patients with PAD. Currently, PAD patient identification relies on diagnosis/procedure codes or lists of patients diagnosed or treated by specific providers in specific locations and ways. The goal of this research was to leverage natural language processing to more accurately identify patients with PAD in an electronic health record system compared with a structured data-based approach.
Methods: The clinical notes from a cohort of 6861 patients in our health system whose PAD status had previously been adjudicated were used to train, test, and validate a natural language processing model using 10-fold cross-validation. The performance of this model was described using the area under the receiver operating characteristic and average precision curves; its performance was quantitatively compared with an administrative data-based least absolute shrinkage and selection operator (LASSO) approach using the DeLong test.
Results: The median (SD) of the area under the receiver operating characteristic curve for the natural language processing model was 0.888 (0.009) versus 0.801 (0.017) for the LASSO-based approach alone (DeLong P<0.0001). The median (SD) of the area under the precision curve was 0.909 (0.008) versus 0.816 (0.012) for the structured data-based approach. When sensitivity was set at 90%, the precision for LASSO was 65% and the machine learning approach was 74%, while the specificity for LASSO was 41% and for the machine learning approach was 62%.
Conclusions: Using a natural language processing approach in addition to partial cohort preprocessing with a LASSO-based model, we were able to meaningfully improve our ability to identify patients with PAD compared with an approach using structured data alone. This model has potential applications to both interventions targeted at improving patient care as well as efficient, large-scale PAD research. Graphic Abstract: A graphic abstract is available for this article.
Keywords: cohort studies; electronic health records; machine learning; natural language processing; peripheral artery disease.