Treffer: Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study.

Title:
Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study.
Authors:
Snell Q; Brigham Young University, 3361 TMCB, Provo, UT, 84602, United States, 1 8014225098., Westhoff C; Brigham Young University, 3361 TMCB, Provo, UT, 84602, United States, 1 8014225098., Westhoff J; University of Nevada, Reno, Reno, NV, United States., Low E; Brigham Young University, 3361 TMCB, Provo, UT, 84602, United States, 1 8014225098., Hanson CL; Brigham Young University, 3361 TMCB, Provo, UT, 84602, United States, 1 8014225098., Tass ESN; Brigham Young University, 3361 TMCB, Provo, UT, 84602, United States, 1 8014225098.
Source:
Journal of medical Internet research [J Med Internet Res] 2025 Oct 22; Vol. 27, pp. e74094. Date of Electronic Publication: 2025 Oct 22.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: JMIR Publications Country of Publication: Canada NLM ID: 100959882 Publication Model: Electronic Cited Medium: Internet ISSN: 1438-8871 (Electronic) Linking ISSN: 14388871 NLM ISO Abbreviation: J Med Internet Res Subsets: MEDLINE
Imprint Name(s):
Publication: <2011- > : Toronto : JMIR Publications
Original Publication: [Pittsburgh, PA? : s.n., 1999-
References:
Biochem Med (Zagreb). 2012;22(3):276-82. (PMID: 23092060)
Dermatol Online J. 2017 Jul 15;23(7):. (PMID: 29469693)
J Am Med Inform Assoc. 2020 Jul 1;27(7):1132-1135. (PMID: 32324855)
Int J Environ Res Public Health. 2023 Jan 29;20(3):. (PMID: 36767728)
Syst Rev. 2022 Jun 19;11(1):124. (PMID: 35718770)
Proc Int AAAI Conf Weblogs Soc Media. 2020 Jun;14:464-475. (PMID: 32724726)
Contributed Indexing:
Keywords: Reddit; artificial intelligence; data extraction; large language models; unstructured text analysis
Entry Date(s):
Date Created: 20251022 Date Completed: 20251022 Latest Revision: 20251027
Update Code:
20251027
PubMed Central ID:
PMC12543290
DOI:
10.2196/74094
PMID:
41124662
Database:
MEDLINE

Weitere Informationen

Background: In an era marked by a growing reliance on digital platforms for health care consultation, the subreddit r/AskDocs has emerged as a pivotal forum. However, the vast, unstructured nature of forum data presents a formidable challenge; the extraction and meaningful analysis of such data require advanced tools that can navigate the complexities of language and context inherent in user-generated content. The emergence of large language models (LLMs) offers new tools for the extraction of health-related content from unstructured text found in social media platforms such as Reddit.
Objective: This methodological study aimed to evaluate the use of LLMs to systematically transform the rich, unstructured textual data from the AskDocs subreddit into a structured dataset, an approach that aligns more closely with human cognitive processes than traditional data extraction methods.
Methods: Human annotators and LLMs were used to extract data from 2800 randomly sampled r/AskDocs subreddit posts. For human annotation, at least 2 medical students extracted demographic information, type of inquiry (diagnosis, symptom, or treatment), proxy relationship, chronic condition, health care consultation status, and primary focus topic. For LLM data extraction, specially engineered prompts were created using JavaScript Object Notation and few-shot prompting. Prompts were used to query several state-of-the-art LLMs (eg, Llama 3, Genna, and GPT). Cohen κ was calculated across all human annotators, with this dataset serving as the gold standard for comparison with LLM data extraction. A high degree of human annotator reliability was observed for the coding of demographic information. Lower reliability was seen in coding the health-related content of the posts.
Results: The highest performance scores compared with the gold standard were achieved by Llama 3 70B with 7 few-shot prompt examples (average accuracy=87.4) and GPT-4 with 2 few-shot prompt examples (average accuracy=87.4). Llama 3 70B excelled in coding health-related content while GPT-4 performed better coding demographic content from unstructured posts.
Conclusions: LLMs performed comparably with human annotators in extracting demographic and health-related information from the AskDocs subreddit unstructured posts. This study validates the use of LLMs for analyzing digital health care communications and highlights their potential as reliable tools for understanding online behaviors and interactions, shifting toward more sophisticated methodologies in digital research and practice.
(© Quinn Snell, Chase Westhoff, John Westhoff, Ethan Low, Carl L Hanson, E Shannon Neeley Tass. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).)