Result: Comparing Multiple Imputation Methods to Address Missing Patient Demographics in Immunization Information Systems: Retrospective Cohort Study.
Lancet Healthy Longev. 2021 Mar;2(3):e143-e153. (PMID: 36098112)
Sensors (Basel). 2023 Mar 08;23(6):. (PMID: 36991663)
Cultur Divers Ethnic Minor Psychol. 2023 Jan;29(1):64-73. (PMID: 34351178)
Comput Intell Neurosci. 2021 Oct 12;2021:6522633. (PMID: 34675971)
Nat Commun. 2020 Sep 9;11(1):4507. (PMID: 32908126)
PLoS One. 2022 Mar 1;17(3):e0264270. (PMID: 35231059)
Popul Health Metr. 2021 Nov 4;19(1):44. (PMID: 34736462)
Prev Sci. 2007 Sep;8(3):206-13. (PMID: 17549635)
Rand Health Q. 2022 Nov 14;10(1):4. (PMID: 36484074)
J Racial Ethn Health Disparities. 2023 Aug;10(4):2071-2080. (PMID: 36056195)
Int J Stat Med Res. 2022 Jan 28;11:1-11. (PMID: 35368775)
Am J Public Health. 2021 Jun;111(6):1004-1006. (PMID: 33950717)
J Womens Health (Larchmt). 2024 Mar;33(3):328-338. (PMID: 38112534)
J Am Med Inform Assoc. 2019 Aug 1;26(8-9):722-729. (PMID: 31329882)
Mil Med. 2018 Nov 1;183(11-12):e438-e447. (PMID: 29425378)
J Behav Med. 2023 Jun;46(3):525-531. (PMID: 36417011)
Vaccines (Basel). 2023 Apr 21;11(4):. (PMID: 37112788)
J Healthc Inform Res. 2020 Dec;4(4):383-394. (PMID: 33283143)
Ann Epidemiol. 2024 Jun;94:120-126. (PMID: 38734192)
JMIR Public Health Surveill. 2022 Nov 9;8(11):e38037. (PMID: 36350701)
MMWR Surveill Summ. 2021 May 14;70(3):1-26. (PMID: 33983910)
J Racial Ethn Health Disparities. 2024 Oct 22;:. (PMID: 39436568)
Acta Inform Med. 2018;26(1):24-28. (PMID: 29719309)
Epidemiology. 2023 Mar 1;34(2):206-215. (PMID: 36722803)
Int J Methods Psychiatr Res. 2011 Mar;20(1):40-9. (PMID: 21499542)
Prehosp Emerg Care. 2023;27(8):1072-1075. (PMID: 36735657)
BMC Med Inform Decis Mak. 2022 Jan 13;22(1):13. (PMID: 35027065)
Int J Equity Health. 2024 Jul 18;23(1):143. (PMID: 39026324)
J Perinatol. 2025 Mar;45(3):372-377. (PMID: 39304729)
J Prev (2022). 2022 Aug;43(4):421-467. (PMID: 35687259)
J Big Data. 2021;8(1):140. (PMID: 34722113)
Lancet. 2020 Nov 7;396(10261):e81. (PMID: 33169681)
JAMA Netw Open. 2022 Jun 1;5(6):e2216715. (PMID: 35687340)
MMWR Morb Mortal Wkly Rep. 2021 Aug 13;70(32):1075-1080. (PMID: 34383729)
MMWR Morb Mortal Wkly Rep. 2020 Jun 19;69(24):759-765. (PMID: 32555134)
Epidemiology. 2021 Mar 1;32(2):157-161. (PMID: 33323745)
Further Information
Background: Immunization Information Systems (IIS) and surveillance data are essential for public health interventions and programming; however, missing data are often a challenge, potentially introducing bias and impacting the accuracy of vaccine coverage assessments, particularly in addressing disparities.
Objective: This study aimed to evaluate the performance of 3 multiple imputation methods, Stata's (StataCorp LLC) multiple imputation using chained equations (MICE), scikit-learn's Iterative-Imputer, and Python's miceforest package, in managing missing race and ethnicity data in large-scale surveillance datasets. We compared these methodologies in their ability to preserve demographic distribution, computational efficiency, and performed G-tests on contingency tables to obtain likelihood ratio statistics to assess the association between race and ethnicity and flu vaccination status.
Methods: In this retrospective cohort study, we analyzed 2021-2022 flu vaccination and demographic data from the West Virginia Immunization Information System (N=2,302,036), where race (15%) and ethnicity (34%) were missing. MICE, Iterative Imputer, and miceforest were used to impute missing variables, generating 15 datasets each. Computational efficiency, demographic distribution preservation, and spatial clustering patterns were assessed using G-statistics.
Results: After imputation, an additional 780,339 observations were obtained compared with complete case analysis. All imputation methods exhibited significant spatial clustering for race imputation (G-statistics: MICE=26,452.7, Iterative-Imputer=128,280.3, Miceforest=26,891.5; P<.001), while ethnicity imputation showed variable clustering patterns (G-statistics: MICE=1142.2, Iterative-Imputer=1.7, Miceforest=2185.0; P: MICE<.001, Iterative-Imputer=1.7, Miceforest<.001). MICE and miceforest best preserved the proportional distribution of demographics. Computational efficiency varied, with MICE requiring 14 hours, Iterative Imputer 2 minutes, and miceforest 10 minutes for 15 imputations. Postimputation estimates indicated a 0.87%-18% reduction in stratified flu vaccination coverage rates. Overall estimated flu vaccination rates decreased from 26% to 19% after imputations.
Conclusions: Both MICE and Miceforest offer flexible and reliable approaches for imputing missing demographic data while mitigating bias compared with Iterative-Imputer. Our results also highlight that the imputation method can profoundly affect research findings. Though MICE and Miceforest had better effect sizes and reliability, MICE was much more computationally and time-expensive, limiting its use in large, surveillance datasets. Miceforest can use cloud-based computing, which further enhances efficiency by offloading resource-intensive tasks, enabling parallel execution, and minimizing processing delays. The significant decrease in vaccination coverage estimates validates how incomplete or missing data can eclipse real disparities. Our findings support regular application of imputation methods in immunization surveillance to improve health equity evaluations and shape targeted public health interventions and programming.
(© Sara Brown, Ousswa Kudia, Kaye Kleine, Bryndan Kidd, Robert Wines, Nathanael Meckes. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org).)