Large Language Model-Driven Narrative Generation Study Data: ChatGPT-Generated Narratives, Real Tweets, and Source Code

Published: 23 November 2023| Version 2 | DOI: 10.17632/nyxndvwfsh.2
Contributors:
, Ross Gore,

Description

In the interests of advancing Large Language Models (LLMs) usage in engineering, science, and medicine, and other fields, we provide the data sets and code associated with the Structured Narrative Prompt for LLMs Study. Data for this study was generated using an Agent-Based Model (ABM), the LLM ChatGPT, and using a set of tweets previously collected from Twitter. To facilitate reproducibility, transparency, and reuse of our work, this repository includes: (1) Simulation-related code and data for generating simulated agents' life events (a) output from the Java ABM simulation, including the ABM-generated narratives and associated life-event information (2) ChatGPT-related code and data (a) the Python script that generates structured prompts for ChatGPT from the ABM-generated life events (b) the set of generated structured prompts (inputs) for ChatGPT, (used to generate the LLM narratives) (c) the Python script that submits the structured prompts to ChatGPT via the API (d) the set of ChatGPT-generated narratives (e) the Python script that combines ChatGPT (output) narratives with the ABM simulation narratives, in preparation for PANAS sentiment analysis (3) Analysis-related code and data (a) the PANAS sentiment analysis R scripts (b) the statistical significance test R scripts (Chi-squared test and Fisher's exact test), used for finding significant differences in sentiment scoring among ABM-generated narratives, LLM-generated narratives, and the real tweets (a) the PANAS lexicon used for the sentiment analysis (b) the set of utilized tweets with PII removed (c) the approved IRB documentation for collecting those tweets Folder Names/Breakdown for Data File section: 1. LLM-related Scripts and Data: LLM_Phase_Scripts_and_Data.zip 2. Analysis-related Scripts and Data: Analysis_Phase_Scripts_and_Data.zip

Files

Steps to reproduce

Java ABM Simulation Phase: None: the ABM output CSV files (abm_output.csv and new_deaths.csv), that are located in LLM_Phase_Scripts_and_Data.zip, are used by chatgpt_prompt_generator.py and gen_analysis_file.py, also in that zipped archive. ---------------------------------------------------------------------------------------------------- ChatGPT Narrative Generation and Analysis Preparation Phase (LLM_Phase_Scripts_and_Data.zip): (1) Run “python chatgpt_prompt_generator” with a Python 3 interpreter (same for all Python scripts). This populates “chat_inputs” with LLM prompts. (2) In submit_prompts_chat.py, add your OpenAI API key on line 7. Run “python submit_prompts_chat.py”. This iteratively selects a prompt from “chat_inputs”, submits it to ChatGPT (GPT-3.5), puts the response text in “chat_outputs”, and moves the input prompt file to “submitted_chat_inputs”. If/when a time-out error is received from OpenAI, run “python submit_prompts_chat.py” again, and the prompt submission will resume without repetitions or skipped inputs. (3) Run “python gen_analysis_file.py” to create the CSV file ("expanded_analysis_df.csv") containing ABM narratives and comparable LLM narratives, for sentiment analysis in the next phase. ---------------------------------------------------------------------------------------------------- PANAS Sentiment Analysis and Statistical Significance Analysis Phase (Analysis_Phase_Scripts_and_Data.zip): Sentiment Analysis: (1) Run "Rscript construct_panas_lexicon.R", to generate the PANAS lexicon file "panas_lexicon.csv". R packages may need to be downloaded and installed for this phase. (2) Run "Rscript prep_data.R" to construct "expanded_analysis_df.csv". The LLM-phase output file "expanded_analysis_df.csv" needs to be in this directory. (3) Run "Rscript study_score.R" to generate "study_all_users_classed_and_score_tweets_EXPANDED.RDS". Statistical Analysis: (1) Run "Rscript study_test_distributions_of_variables_from_multiple_simulation_runs.R" to generate the statistical analysis output files in "../output/".

Institutions

Old Dominion University

Categories

Machine Learning, Agent-Based Modeling, Narrative Analysis, Chi-Square Testing, Data Analytics, Sentiment Analysis, Artificial General Intelligence, ChatGPT, Chatbot, Prompt Engineering

Funding

Old Dominion University

300916-010

Licence