ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

Name: ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods
Creator: Christopher Lynch
Published: 2025-08-15T17:43:57.702Z
Keywords: Artificial Intelligence, Machine Learning, Ensemble, Validation Study, Model Validation, Deep Learning, Bidirectional Encoder Representations From Transformers, Extreme Gradient Boosting, Large Language Model, Explainable LLM

Lynch, Christopher; Jensen, Erik; Gore, Ross

doi:10.17632/g2sdzmssgh.1

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

Published: 15 August 2025| Version 1 | DOI: 10.17632/g2sdzmssgh.1

Contributors:

,

Description

This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including: * Tagged datasets (.csv): human-tagged gold labels for evaluation * Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative - Suitable for inference, semi-automatic labeling, or transfer learning * Python and R code for preprocessing, model training, evaluation, and visualization * Configuration files and environment specifications to enable end-to-end reproducibility The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts. Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation). Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables. File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code). Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation. Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

Files

Steps to reproduce

1) R Project * Open the .Rproj file from Code - R Project.zip in RStudio. * The project is fully configured to run locally without additional setup beyond required R packages. *Executing the scripts from within the R project will: - Load the human-tagged and AI-tagged datasets. - Generate confusion matrices comparing model performance across the two tagging sources. 2) Python Scripts * Extract Code - Python Scripts.zip to your working directory. * Run the scripts individually from the command line or an IDE. * These scripts will: - Train and evaluate ML models (BERT, Keras, XGBoost, and ensemble methods). - Produce performance metrics and output files. - Create timing figures based on recorded training time data. 3) Data Folders * These folders contain: - Original, untagged, AI-generated messages using GPT-4 - Human-tagged message set * Original data set * Validation data set - ML-tagged message set (BERT, Keras-based architectures, eXtreme Gradient Boosting, Random Forest, and Support Vector Machines) * Original data set * Validation data set - ML timing data - ML Classification Data using 10k-fold cross-validation

Institutions

Old Dominion University

Funding

Old Dominion University

300916-010

Department of Education Modeling and Simulation Program

P116S210003

Commonwealth Cyber Initiative (CCI) Fellowship

H-2Q24-016

Office of Naval Research

N00014-19-1-2624

United States Air Force Office of Scientific Research

22RT0286

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

Description

Files

Steps to reproduce

Institutions

Categories

Funding

Related Links

Licence