Treffer: Semantic classification of Indonesian consumer health questions.

Title:
Semantic classification of Indonesian consumer health questions.
Authors:
Hanami RN; Faculty of Computer Science, Universitas Indonesia, Kampus UI, Depok, 16424, West Java, Indonesia., Mahendra R; Faculty of Computer Science, Universitas Indonesia, Kampus UI, Depok, 16424, West Java, Indonesia., Wicaksono AF; Faculty of Computer Science, Universitas Indonesia, Kampus UI, Depok, 16424, West Java, Indonesia. alfan@cs.ui.ac.id.
Source:
Journal of biomedical semantics [J Biomed Semantics] 2025 Jul 28; Vol. 16 (1), pp. 13. Date of Electronic Publication: 2025 Jul 28.
Publication Type:
Journal Article; Research Support, Non-U.S. Gov't
Language:
English
Journal Info:
Publisher: Biomed Central Country of Publication: England NLM ID: 101531992 Publication Model: Electronic Cited Medium: Internet ISSN: 2041-1480 (Electronic) NLM ISO Abbreviation: J Biomed Semantics Subsets: MEDLINE
Imprint Name(s):
Original Publication: [London] : Biomed Central
References:
J Am Med Inform Assoc. 2016 Jul;23(4):802-11. (PMID: 27147494)
J Biomed Inform. 2011 Apr;44(2):277-88. (PMID: 21256977)
BMC Bioinformatics. 2018 Feb 06;19(1):34. (PMID: 29409442)
J Fam Pract. 1997 Nov;45(5):382-8. (PMID: 9374962)
AMIA Annu Symp Proc. 2014 Nov 14;2014:1018-27. (PMID: 25954411)
BMJ. 2000 Aug 12;321(7258):429-32. (PMID: 10938054)
BMC Med Inform Decis Mak. 2018 Mar 22;18(Suppl 1):16. (PMID: 29589562)
J Biomed Inform. 2012 Apr;45(2):292-306. (PMID: 22142949)
Pediatrics. 2004 Jan;113(1 Pt 1):64-9. (PMID: 14702450)
J Biomed Inform. 2011 Dec;44(6):1032-8. (PMID: 21856442)
AMIA Annu Symp Proc. 2015 Nov 05;2015:727-36. (PMID: 26958208)
BMJ. 1999 Aug 7;319(7206):358-61. (PMID: 10435959)
Proc ACM Int Conf Inf Knowl Manag. 2016 Oct;2016:297-306. (PMID: 28758046)
AMIA Annu Symp Proc. 2011;2011:171-80. (PMID: 22195068)
J Fam Pract. 1995 Dec;41(6):583-90. (PMID: 7500068)
Methods Inf Med. 2017 May 18;56(3):209-216. (PMID: 28361158)
Biometrics. 1977 Mar;33(1):159-74. (PMID: 843571)
Grant Information:
NKB-004/UN2.F11.D/HKP.05.00/2023 Faculty of Computer Science, Universitas Indonesia
Contributed Indexing:
Keywords: Consumer health question-answering system; Consumer health questions; Semantic annotation scheme; Semantic type classification; Text mining
Entry Date(s):
Date Created: 20250728 Date Completed: 20250729 Latest Revision: 20250818
Update Code:
20250821
PubMed Central ID:
PMC12302743
DOI:
10.1186/s13326-025-00334-5
PMID:
40721829
Database:
MEDLINE

Weitere Informationen

Purpose: Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.
Methods: This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model's predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model's decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of "semantic bias", where words with no inherent association with a specific semantic type disproportionately influence the model's predictions.
Results: The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words "kanker" (cancer) and "depresi" (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.
Conclusion: We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.
(© 2025. The Author(s).)

Declarations. Ethics approval and consent to participate: Not applicable. Competing interests: The authors declare the following potential competing interests. Rahmad Mahendra is currently a PhD student at School of Computing Technologies at RMIT University, working at research project at The ARC Training Centre in Cognitive Computing for Medical Technologies. He is also affiliated with School of Computing and Information Systems, the University of Melbourne.