A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications

BMC Med Res Methodol. 2024 Nov 8;24(1):269. doi: 10.1186/s12874-024-02392-2.

Abstract

Background: Missing values in datasets present significant challenges for data analysis, particularly in the medical field where data accuracy is crucial for patient diagnosis and treatment. Although MissForest (MF) has demonstrated efficacy in imputation research and recursive feature elimination (RFE) has proven effective in feature selection, the potential for enhancing MF through RFE integration remains unexplored.

Methods: This study introduces a novel imputation method, "recursive feature elimination-MissForest" (RFE-MF), designed to enhance imputation quality by reducing the impact of irrelevant features. A comparative analysis is conducted between RFE-MF and four classical imputation methods: mean/mode, k-nearest neighbors (kNN), multiple imputation by chained equations (MICE), and MF. The comparison is carried out across ten medical datasets containing both numerical and mixed data types. Different missing data rates, ranging from 10 to 50%, are evaluated under the missing completely at random (MCAR) mechanism. The performance of each method is assessed using two evaluation metrics: normalized root mean squared error (NRMSE) and predictive fidelity criterion (PFC). Additionally, paired samples t-tests are employed to analyze the statistical significance of differences among the outcomes.

Results: The findings indicate that RFE-MF demonstrates superior performance across the majority of datasets when compared to four classical imputation methods (mean/mode, kNN, MICE, and MF). Notably, RFE-MF consistently outperforms the original MF, irrespective of variable type (numerical or categorical). Mean/mode imputation exhibits consistent performance across various scenarios. Conversely, the efficacy of kNN imputation fluctuates in relation to varying missing data rates.

Conclusion: This study demonstrates that RFE-MF holds promise as an effective imputation method for medical datasets, providing a novel approach to addressing missing data challenges in medical applications.

Keywords: Feature selection; Medical datasets; MissForest; Missing value imputation; Recursive feature elimination.

MeSH terms

  • Algorithms*
  • Data Accuracy
  • Data Interpretation, Statistical
  • Humans
  • Models, Statistical