Nigerian Academic Writing Corpus: Pre-AI Benchmark for AI-Generated Text Detection

Published: 14 April 2026| Version 1 | DOI: 10.17632/mfyx7pxwws.1
Contributor:
Olanrewaju Akisanya

Description

A curated corpus of pre-AI era (2005–2022) Nigerian academic writing paired with AI-generated equivalents for training and evaluating AI-generated text detectors calibrated to Nigerian English. The dataset contains two components: (1) 4,239 authentic academic texts extracted from Covenant University and University of Ibadan open-access repositories, spanning theses, dissertations, journal articles, and student projects across multiple disciplines; and (2) 6,000 AI-generated academic essays produced by six major large language models (GPT-4o-mini, GPT-4o, Claude Sonnet, Gemini 2.5 Flash, Grok 3 Mini, and DeepSeek Chat) using topics drawn from the human corpus. All human texts predate the launch of ChatGPT (November 2022), ensuring zero AI contamination in the baseline. The paired structure enables controlled experiments for binary classification (human vs AI), multi-class source attribution (identifying which LLM generated a text), and cross-dialectal evaluation of AI detection systems on African English.

Files

Steps to reproduce

Human texts were harvested from Covenant University EPrints (OAI-PMH) and University of Ibadan DSpace repositories. PDF text was extracted using PyMuPDF with Tesseract OCR fallback. AI texts were generated via commercial LLM APIs (OpenAI, Anthropic, Google, xAI, DeepSeek) using a standardised prompt requesting 800-1500 word academic essays on topics from the human corpus. Temperature 0.8, max tokens 2000. Full methodology in the accompanying README.md.

Categories

Natural Language Processing, Generative Artificial Intelligence

Licence