Data for: A Framework for LLM-Facilitated Infodemiological Research: Democratizing the Analysis of COVID-19 Public Health Discourse Using Freely Accessible AI Tools

Published: 1 October 2025| Version 1 | DOI: 10.17632/pnd3dhh4vw.1
Contributors:
Maher Asaad Baker,
,

Description

This dataset contains the research data supporting the findings of the associated paper, which introduces a novel framework for conducting infodemiological research using freely available Large Language Models (LLMs). The data exemplifies the application of the framework's five-phase methodology (Research Design, Data Collection, LLM Analysis, Validation, and Visualization) to two key use cases from the COVID-19 pandemic. The dataset is structured into the following primary components: 1. Vaccine Hesitancy Rhetoric Analysis Data: This subset includes: ● Anonymized Twitter Post IDs and Metadata: A list of Tweet IDs and corresponding dates collected via the Twitter API v2 for both pro-vaccine and vaccine-hesitant discourse during the initial vaccine rollout period (Dec 2020 - June 2021). ● Structured LLM Prompts: The exact prompt templates used for the iterative LLM-facilitated rhetorical analysis. ● LLM Analysis Outputs: Coded data from the LLM, identifying rhetorical frames (e.g., "Appeal to Personal Sovereignty," "Distrust of Pharmaceutical Motives"), representative quotes, and classified emotions. 2. Mental Health Discourse Evolution Data: This subset includes: ● Anonymized Reddit Post IDs and Metadata: A list of Post IDs from the r/COVID19_support subreddit for three key pandemic phases (Q2 2020, Q2 2021, Q2 2022). ● Structured LLM Prompts: The prompt templates used for simultaneous sentiment and thematic analysis of mental health discussions. ● LLM Analysis Outputs: Coded data from the LLM, including sentiment classifications (Positive, Negative, Neutral) and identified primary mental health concerns (e.g., "Social Isolation," "Pandemic Fatigue," "Grief") for each time period. 3. Validation Data: This includes the researcher-coded samples used for the Inter-Rater Reliability (IRR) checks, allowing for the verification of the LLM's analytical consistency. This dataset provides a practical, real-world benchmark for researchers aiming to apply the proposed LLM-facilitated framework to public health discourse. It demonstrates the entire pipeline from raw data collection to validated, analyzed results, ensuring the reproducibility and transparency of the research.

Files

Categories

Linguistics, Computer Science, Public Health, Health Services Research, COVID-19

Licence