Machine learning computational model to predict lung cancer using electronic medical records

Cancer Epidemiol. 2024 Oct:92:102631. doi: 10.1016/j.canep.2024.102631. Epub 2024 Jul 24.

Abstract

Background: Lung cancer (LC) screening using low-dose computed tomography (CT) is recommended according to standard risk criteria or personalized risk calculators. Machine learning (ML) models that can predict disease risk are an emerging method in medicine for identifying hidden associations that are personally unique.

Materials and methods: Using the tree-based pipeline optimization tool (TPOT), we developed an ML-based model, which is an ensemble of the Random Forest and XGboost models, based on known risk factors for LC, as part of a larger trial for ML prediction using electronic medical records and chest CT. We used data from patients with LC vs. controls (1:2) of patients aged ≥ 35 years. We developed a model for all LC patients as well as for patients with and without a smoking background. We included age, gender, body mass index (BMI), smoking history, socioeconomic status (SES), history of chronic obstructive pulmonary disease (COPD)/emphysema/chronic bronchitis (CB), interstitial lung disease (ILD)/pulmonary fibrosis (PF), and family history of LC.

Results: Of the 4076 patients, 1428 (35 %) were in the LC group and 2648 (65 %) were in the control group. For the entire study population, our model achieved an accuracy of 71.2 %, with a sensitivity of 69 % and a positive predictive value (PPV) of 74 %. Higher accuracy was achieved for the two subgroups. An accuracy of 74.8 % (sensitivity 72 %, PPV 76 %) and 73.0 % (sensitivity 76 %, PPV 72 %) was achieved for the smoking and never-smoking cohorts, respectively. For the entire population and smoker cohort, COPD/emphysema/CB were the most important contributors, followed by BMI and age, while in the never-smoking cohort, BMI, age and SES were the most important contributors.

Conclusion: Known risk factors for LC could be used in ML models to modestly predict LC. Further studies are needed to confirm these results in new patients and to improve them.

Keywords: Artificial intelligence; Lung cancer; Machine learning; Prediction; Smoking.

MeSH terms

  • Adult
  • Aged
  • Case-Control Studies
  • Early Detection of Cancer / methods
  • Electronic Health Records* / statistics & numerical data
  • Female
  • Humans
  • Lung Neoplasms* / diagnosis
  • Lung Neoplasms* / epidemiology
  • Machine Learning*
  • Male
  • Middle Aged
  • Risk Assessment / methods
  • Risk Factors
  • Smoking / epidemiology
  • Tomography, X-Ray Computed* / methods