Data for: A Framework for LLM-Facilitated Infodemiological Research: Democratizing the Analysis of COVID-19 Public Health Discourse Using Freely Accessible AI Tools
Description
This dataset contains the research data supporting the findings of the associated paper, which introduces a novel framework for conducting infodemiological research using freely available Large Language Models (LLMs). The data exemplifies the application of the framework's five-phase methodology (Research Design, Data Collection, LLM Analysis, Validation, and Visualization) to two key use cases from the COVID-19 pandemic. The dataset is structured into the following primary components: 1. Vaccine Hesitancy Rhetoric Analysis Data: This subset includes: ● Anonymized Twitter Post IDs and Metadata: A list of Tweet IDs and corresponding dates collected via the Twitter API v2 for both pro-vaccine and vaccine-hesitant discourse during the initial vaccine rollout period (Dec 2020 - June 2021). ● Structured LLM Prompts: The exact prompt templates used for the iterative LLM-facilitated rhetorical analysis. ● LLM Analysis Outputs: Coded data from the LLM, identifying rhetorical frames (e.g., "Appeal to Personal Sovereignty," "Distrust of Pharmaceutical Motives"), representative quotes, and classified emotions. 2. Mental Health Discourse Evolution Data: This subset includes: ● Anonymized Reddit Post IDs and Metadata: A list of Post IDs from the r/COVID19_support subreddit for three key pandemic phases (Q2 2020, Q2 2021, Q2 2022). ● Structured LLM Prompts: The prompt templates used for simultaneous sentiment and thematic analysis of mental health discussions. ● LLM Analysis Outputs: Coded data from the LLM, including sentiment classifications (Positive, Negative, Neutral) and identified primary mental health concerns (e.g., "Social Isolation," "Pandemic Fatigue," "Grief") for each time period. 3. Validation Data: This includes the researcher-coded samples used for the Inter-Rater Reliability (IRR) checks, allowing for the verification of the LLM's analytical consistency. This dataset provides a practical, real-world benchmark for researchers aiming to apply the proposed LLM-facilitated framework to public health discourse. It demonstrates the entire pipeline from raw data collection to validated, analyzed results, ensuring the reproducibility and transparency of the research.