Semi-supervised learning to improve generalizability of risk prediction models

Shengqiang Chi; Xinhang Li; Yu Tian; Jun Li; Xiangxing Kong; Kefeng Ding; Chunhua Weng; Jingsong Li

doi:10.1016/j.jbi.2019.103117

Semi-supervised learning to improve generalizability of risk prediction models

J Biomed Inform. 2019 Apr:92:103117. doi: 10.1016/j.jbi.2019.103117. Epub 2019 Feb 7.

Authors

Shengqiang Chi¹, Xinhang Li², Yu Tian¹, Jun Li³, Xiangxing Kong³, Kefeng Ding³, Chunhua Weng⁴, Jingsong Li⁵

Affiliations

¹ Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China.
² Department of Biomedical Informatics, Columbia University, NY, USA.
³ Department of Surgical Oncology, The Second Affiliated Hospital of Zhejiang University Medical School, Hangzhou, China.
⁴ Department of Biomedical Informatics, Columbia University, NY, USA. Electronic address: cw2384@cumc.columbia.edu.
⁵ Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China. Electronic address: ljs@zju.edu.cn.

PMID: 30738948
DOI: 10.1016/j.jbi.2019.103117

Abstract

The utility of a prediction model depends on its generalizability to patients drawn from different but related populations. We explored whether a semi-supervised learning model could improve the generalizability of colorectal cancer (CRC) risk prediction relative to supervised learning methods. Data on 113,141 patients diagnosed with nonmetastatic CRC from 2004 to 2012 were obtained from the Surveillance Epidemiology End Results registry for model development, and data on 1149 patients from the Second Affiliated Hospital, Zhejiang University School of Medicine, who were diagnosed between 2004 and 2011, were collected for generalizability testing. A clinical prediction model for CRC survival risk using a semi-supervised logistic regression method was developed and validated to investigate the model discrimination, calibration, generalizability, interpretability and clinical usefulness. Rigorous model performance comparisons with other supervised learning models were performed. The area under the curve of the logistic membership model revealed a large heterogeneity between the development cohort and validation cohort, which is typical of generalizability studies of prediction models. The discrimination was good for all models. Calibration was poor for supervised learning models, while the semi-supervised logistic regression model exhibited a good calibration on the validation cohort, which indicated good generalizability. Clinical usefulness analysis showed that semi-supervised logistic regression can lead to better clinical outcomes than supervised learning methods. These results increase our current understanding of the generalizability of different models and provide a reference for predictive model development for clinical decision-making.

Keywords: Clinical usefulness; Colorectal cancer (CRC); External validation; Generalizability; Prediction model; Semi-supervised learning (SSL).

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adolescent
Adult
Aged
Aged, 80 and over
Child
Colorectal Neoplasms / diagnosis*
Colorectal Neoplasms / mortality*
Diagnosis, Computer-Assisted
Female
Humans
Male
Middle Aged
Models, Statistical*
Prognosis
Risk
Supervised Machine Learning*
Survival Analysis
Young Adult