Result: Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.

Title:
Machine learning-based strategies for improving healthcare data quality: an evaluation of accuracy, completeness, and reusability.
Authors:
Jarmakovica A; Faculty of Computer Science, Information Technology and Energy, Riga Technical University, Riga, Latvia.
Source:
Frontiers in artificial intelligence [Front Artif Intell] 2025 Jul 21; Vol. 8, pp. 1621514. Date of Electronic Publication: 2025 Jul 21 (Print Publication: 2025).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Frontiers Media SA Country of Publication: Switzerland NLM ID: 101770551 Publication Model: eCollection Cited Medium: Internet ISSN: 2624-8212 (Electronic) Linking ISSN: 26248212 NLM ISO Abbreviation: Front Artif Intell Subsets: PubMed not MEDLINE
Imprint Name(s):
Original Publication: Lausanne, Switzerland : Frontiers Media SA, [2018]-
References:
Metab Eng. 2021 Jan;63:34-60. (PMID: 33221420)
BMC Med Res Methodol. 2021 Apr 2;21(1):63. (PMID: 33810787)
Philos Trans A Math Phys Eng Sci. 2016 Apr 13;374(2065):20150202. (PMID: 26953178)
Qual Quant. 2025;59(2):1767-1791. (PMID: 40433560)
Eur J Epidemiol. 2018 May;33(5):459-464. (PMID: 29637384)
BMC Med Inform Decis Mak. 2025 May 27;25(1):198. (PMID: 40426158)
SN Comput Sci. 2021;2(5):377. (PMID: 34278328)
Med Image Anal. 2015 May;22(1):35-47. (PMID: 25725303)
Front Big Data. 2022 Mar 31;5:850611. (PMID: 35434611)
Contributed Indexing:
Keywords: accuracy; completeness; data quality; healthcare data analysis; machine learning; reusability
Entry Date(s):
Date Created: 20250805 Latest Revision: 20250807
Update Code:
20250807
PubMed Central ID:
PMC12319021
DOI:
10.3389/frai.2025.1621514
PMID:
40761812
Database:
MEDLINE

Further Information

Healthcare data quality is a critical factor in clinical decision-making, diagnostic accuracy, and the overall efficacy of healthcare systems. This study addresses key challenges such as missing values and anomalies in healthcare datasets, which can result in misdiagnoses and inefficient resource use. The objective is to develop and evaluate a machine learning-based strategy to improve healthcare data quality, with a focus on three core dimensions: accuracy, completeness, and reusability. A publicly available diabetes dataset comprising 768 records and 9 variables was used. The methodology involved a comprehensive data preprocessing workflow, including data acquisition, cleaning, and exploratory analysis using established Python tools. Missing values were addressed using K-nearest neighbors imputation, while anomaly detection was performed using ensemble techniques. Principal Component Analysis (PCA) and correlation analysis were applied to identify key predictors of diabetes, such as Glucose, BMI, and Age. The results showed significant improvements in data completeness (from 90.57% to nearly 100%), better accuracy by mitigating anomalies, and enhanced reusability for downstream machine learning tasks. In predictive modeling, Random Forest outperformed LightGBM, achieving an accuracy of 75.3% and an AUC of 0.83. The process was fully documented, and reproducibility tools were integrated to ensure the methodology could be replicated and extended. These findings demonstrate the potential of machine learning to support robust data quality improvement frameworks in healthcare, ultimately contributing to better clinical outcomes and predictive capabilities.
(Copyright © 2025 Jarmakovica.)

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.