Machine learning models with innovative outlier detection techniques for predicting heavy metal contamination in soils

J Hazard Mater. 2024 Nov 19:481:136536. doi: 10.1016/j.jhazmat.2024.136536. Online ahead of print.

Abstract

Machine learning (ML) models for accurately predicting heavy metals with inconsistent outputs have improved owing to dataset outliers, which influence model reliability and accuracy. A comprehensive technique that combines machine learning and advanced statistical methods was applied to assess data outlier's effects on ML models. Ten ML models with three outlier detection methods predicted Cr, Ni, Cd, and Pb in Narayanganj soils. XGBoost with density-based spatial clustering of applications with noise (DBSCAN) improved model efficacy (R2). The R2 of Cr, Ni, Cd, and Pb was considerably enhanced by 11.11 %, 6.33 %, 14.47 %, and 5.68 %, respectively, indicating that outliers affected the model's HM prediction. Soil factors affected Cr (80 %), Ni (72.61 %), Cd (53.35 %), and Pb (63.47 %) concentrations based on feature importance. Contamination factor prediction showed considerable contamination for Cr, Ni, and Cd. LISA revealed Cd (55.4 %), Cr (49.3 %), and Pb (47.3 %) as the significant pollutant (p < 0.05). Moran's I index values for Cr, Ni, Cd, and Pb were 0.65, 0.58, 0.60, and 0.66, respectively, indicating strong positive spatial autocorrelation and clusters with similar contamination. Finally, this work successfully assessed the influence of data outliers on the ML model for soil HM contamination prediction, identifying crucial regions that require rapid conservation measures.

Keywords: DBSCAN; Heavy metals; LISA; Machine learning; Outliers; XGBoost model.