Result: Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.
BMC Med Res Methodol. 2021 Apr 2;21(1):63. (PMID: 33810787)
Philos Trans A Math Phys Eng Sci. 2016 Apr 13;374(2065):20150202. (PMID: 26953178)
Qual Quant. 2025;59(2):1767-1791. (PMID: 40433560)
Eur J Epidemiol. 2018 May;33(5):459-464. (PMID: 29637384)
BMC Med Inform Decis Mak. 2025 May 27;25(1):198. (PMID: 40426158)
SN Comput Sci. 2021;2(5):377. (PMID: 34278328)
Med Image Anal. 2015 May;22(1):35-47. (PMID: 25725303)
Front Big Data. 2022 Mar 31;5:850611. (PMID: 35434611)
Further Information
Healthcare data quality is a critical factor in clinical decision-making, diagnostic accuracy, and the overall efficacy of healthcare systems. This study addresses key challenges such as missing values and anomalies in healthcare datasets, which can result in misdiagnoses and inefficient resource use. The objective is to develop and evaluate a machine learning-based strategy to improve healthcare data quality, with a focus on three core dimensions: accuracy, completeness, and reusability. A publicly available diabetes dataset comprising 768 records and 9 variables was used. The methodology involved a comprehensive data preprocessing workflow, including data acquisition, cleaning, and exploratory analysis using established Python tools. Missing values were addressed using K-nearest neighbors imputation, while anomaly detection was performed using ensemble techniques. Principal Component Analysis (PCA) and correlation analysis were applied to identify key predictors of diabetes, such as Glucose, BMI, and Age. The results showed significant improvements in data completeness (from 90.57% to nearly 100%), better accuracy by mitigating anomalies, and enhanced reusability for downstream machine learning tasks. In predictive modeling, Random Forest outperformed LightGBM, achieving an accuracy of 75.3% and an AUC of 0.83. The process was fully documented, and reproducibility tools were integrated to ensure the methodology could be replicated and extended. These findings demonstrate the potential of machine learning to support robust data quality improvement frameworks in healthcare, ultimately contributing to better clinical outcomes and predictive capabilities.
(Copyright © 2025 Jarmakovica.)
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.