Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Hendrikus J A van Os; Jos P Kanning; Marieke J H Wermer; Niels H Chavannes; Mattijs E Numans; Ynte M Ruigrok; Erik W van Zwet; Hein Putter; Ewout W Steyerberg; Rolf H H Groenwold

doi:10.3389/fepid.2022.871630

Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Front Epidemiol. 2022 Jun 2:2:871630. doi: 10.3389/fepid.2022.871630. eCollection 2022.

Authors

Hendrikus J A van Os^{1

2

3}, Jos P Kanning⁴, Marieke J H Wermer¹, Niels H Chavannes^{2

3}, Mattijs E Numans³, Ynte M Ruigrok⁴, Erik W van Zwet⁵, Hein Putter⁵, Ewout W Steyerberg⁵, Rolf H H Groenwold^{5

6}

Affiliations

¹ Department of Neurology, Leiden University Medical Hospital, Leiden, Netherlands.
² National eHealth Living Lab, Leiden University Medical Hospital, Leiden, Netherlands.
³ Department of Public Health & Primary Care, Leiden University Medical Hospital, Leiden, Netherlands.
⁴ Department of Neurology, University Medical Center Utrecht, Utrecht, Netherlands.
⁵ Department of Biomedical Data Sciences, Leiden University Medical Hospital, Leiden, Netherlands.
⁶ Department of Clinical Epidemiology, Leiden University Medical Hospital, Leiden, Netherlands.

Abstract

Objective: To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR).

Study design and setting: Cox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set.

Results: We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83-0.84), while complete case analysis led to overestimation (calibration curve intercept: -0.52; 95% CI: -0.53 to -0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination.

Conclusion: Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.

Keywords: clinical prediction model; data preparation; electronic health records (EHRs); model performance; model transportability; prediction model.