ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

Published: 15 August 2025| Version 1 | DOI: 10.17632/g2sdzmssgh.1
Contributors:
,
,

Description

This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including: * Tagged datasets (.csv): human-tagged gold labels for evaluation * Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative - Suitable for inference, semi-automatic labeling, or transfer learning * Python and R code for preprocessing, model training, evaluation, and visualization * Configuration files and environment specifications to enable end-to-end reproducibility The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts. Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation). Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables. File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code). Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation. Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

Files

Steps to reproduce

1) R Project * Open the .Rproj file from Code - R Project.zip in RStudio. * The project is fully configured to run locally without additional setup beyond required R packages. *Executing the scripts from within the R project will: - Load the human-tagged and AI-tagged datasets. - Generate confusion matrices comparing model performance across the two tagging sources. 2) Python Scripts * Extract Code - Python Scripts.zip to your working directory. * Run the scripts individually from the command line or an IDE. * These scripts will: - Train and evaluate ML models (BERT, Keras, XGBoost, and ensemble methods). - Produce performance metrics and output files. - Create timing figures based on recorded training time data. 3) Data Folders * These folders contain: - Original, untagged, AI-generated messages using GPT-4 - Human-tagged message set * Original data set * Validation data set - ML-tagged message set (BERT, Keras-based architectures, eXtreme Gradient Boosting, Random Forest, and Support Vector Machines) * Original data set * Validation data set - ML timing data - ML Classification Data using 10k-fold cross-validation

Institutions

Old Dominion University

Categories

Artificial Intelligence, Machine Learning, Ensemble, Validation Study, Model Validation, Deep Learning, Bidirectional Encoder Representations From Transformers, Extreme Gradient Boosting, Large Language Model, Explainable LLM

Funding

Old Dominion University

300916-010

Department of Education Modeling and Simulation Program

P116S210003

Commonwealth Cyber Initiative (CCI) Fellowship

H-2Q24-016

Office of Naval Research

N00014-19-1-2624

United States Air Force Office of Scientific Research

22RT0286

Licence