Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer

A M G Ali; S-J Dawson; F M Blows; E Provenzano; I O Ellis; L Baglietto; D Huntsman; C Caldas; P D Pharoah

doi:10.1038/sj.bjc.6606078

Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer

Br J Cancer. 2011 Feb 15;104(4):693-9. doi: 10.1038/sj.bjc.6606078. Epub 2011 Jan 25.

Authors

A M G Ali¹, S-J Dawson, F M Blows, E Provenzano, I O Ellis, L Baglietto, D Huntsman, C Caldas, P D Pharoah

Affiliation

¹ Strangeways Research Laboratory, Department of Public Health and Primary Care, University of Cambridge, Wort's Causeway, Cambridge CB1 8RN, UK. alaa@srl.cam.ac.uk

Abstract

Background: Tissue micro-arrays (TMAs) are increasingly used to generate data of the molecular phenotype of tumours in clinical epidemiology studies, such as studies of disease prognosis. However, TMA data are particularly prone to missingness. A variety of methods to deal with missing data are available. However, the validity of the various approaches is dependent on the structure of the missing data and there are few empirical studies dealing with missing data from molecular pathology. The purpose of this study was to investigate the results of four commonly used approaches to handling missing data from a large, multi-centre study of the molecular pathological determinants of prognosis in breast cancer.

Patients and methods: We pooled data from over 11,000 cases of invasive breast cancer from five studies that collected information on seven prognostic indicators together with survival time data. We compared the results of a multi-variate Cox regression using four approaches to handling missing data - complete case analysis (CCA), mean substitution (MS) and multiple imputation without inclusion of the outcome (MI-) and multiple imputation with inclusion of the outcome (MI+). We also performed an analysis in which missing data were simulated under different assumptions and the results of the four methods were compared.

Results: Over half the cases had missing data on at least one of the seven variables and 11 percent had missing data on 4 or more. The multi-variate hazard ratio estimates based on multiple imputation models were very similar to those derived after using MS, with similar standard errors. Hazard ratio estimates based on the CCA were only slightly different, but the estimates were less precise as the standard errors were large. However, in data simulated to be missing completely at random (MCAR) or missing at random (MAR), estimates for MI+ were least biased and most accurate, whereas estimates for CCA were most biased and least accurate.

Conclusion: In this study, empirical results from analyses using CCA, MS, MI- and MI+ were similar, although results from CCA were less precise. The results from simulations suggest that in general MI+ is likely to be the best. Given the ease of implementing MI in standard statistical software, the results of MI+ and CCA should be compared in any multi-variate analysis where missing data are a problem.

Publication types

Comparative Study
Meta-Analysis
Research Support, Non-U.S. Gov't

MeSH terms

Bias
Biomarkers, Tumor / analysis
Biomarkers, Tumor / metabolism*
Breast Neoplasms / diagnosis
Breast Neoplasms / epidemiology
Breast Neoplasms / metabolism*
Breast Neoplasms / mortality*
Carcinoma / diagnosis
Carcinoma / epidemiology
Carcinoma / metabolism*
Carcinoma / mortality*
Data Interpretation, Statistical*
Female
Humans
Immunohistochemistry / methods
Immunohistochemistry / statistics & numerical data
Middle Aged
Multicenter Studies as Topic
Prognosis
Reproducibility of Results
Research Design
Survival Analysis
Tissue Array Analysis / statistics & numerical data

Substances

Biomarkers, Tumor

Abstract

Publication types

MeSH terms

Substances

Grants and funding