CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization and Explanation in AI-Generated Code

Published: 7 November 2025| Version 1 | DOI: 10.17632/wxmnyrp668.1
Contributors:
Omer FOTSO,
,
,

Description

CodeLLMExp is a comprehensive, large-scale, multi-language, and multi-vulnerability dataset created to advance research into the security of AI-generated code. It is specifically designed to train and evaluate machine learning models, such as Large Language Models (LLMs), on the joint tasks of Automated Vulnerability Localization (AVL) and Explainable AI (XAI). The dataset was constructed through a rigorous pipeline that involved sourcing prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios), employing seed augmentation to ensure coverage of under-represented Common Weakness Enumerations (CWEs), and using a chain of LLMs to generate vulnerable code snippets. This raw data was then automatically evaluated for quality by an "LLM-as-judge" (validated against human experts with a Spearman correlation of 0.8545) and enriched with structured annotations. CodeLLMExp covers three of the most widely used programming languages : Python, Java and C. It contains 10,400 high-quality examples across Python (44.3%), Java (29.6%), and C (26.1%). It focuses on 29 distinct CWEs, including the complete CWE Top 25 Most Dangerous Software Errors (2024. Each record in the dataset provides a vulnerable code snippet, the precise line number of the flaw, a structured explanation (root cause, impact, mitigation), and a fixed version of the code. By providing richly annotated data for detection, classification, localization, and explanation, CodeLLMExp enables the development of more robust and transparent security analysis tools. It facilitates research into LLM adaptation strategies (e.g., prompting, fine-tuning, Retrieval-Augmented Generation), automated program repair, and the inherent security patterns of code produced by AI.

Files

Steps to reproduce

The dataset was produced via a five-phase automated pipeline: 1. Prompt Aggregation: We collected incomplete code prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios) to serve as a basis for generation. 2. LLM-Assisted Generation: Using Llama 3.1 (8B), we first augmented the initial prompts to increase structural diversity. Then, we generated complete, vulnerable code snippets in Python, Java, and C, programmatically inserting a `# BAD` or `// BAD` marker before the flawed line. A separate step produced a structured JSON explanation (root_cause, impact, mitigation) for each vulnerability. 3. Seed-Based Augmentation: To ensure broad coverage, especially for the 2024 CWE Top 25 list, we used a process where the LLM generated multiple variations from "seed" examples of specific vulnerabilities. 4. Automated Evaluation & Filtering: An "LLM-as-judge" (Gemini 2.5 Flash) automatically evaluated each generated record, assigning a quality score from 1 (Very Bad) to 5 (Very Good) based on a strict rubric assessing code validity, CWE correctness, and explanation accuracy. Only records with a score of 3 or higher were retained. A corrected `fixed_code` version was then generated for each validated record. 5. Human Meta-Evaluation: To validate the automated judge, human security experts independently rated a stratified sample of 2,082 records. The high Spearman correlation (ρ = 0.8545, p < 0.0001) between the human and LLM scores confirmed the reliability of our quality control process.

Institutions

Universite de Yaounde I

Categories

Computer Science, Artificial Intelligence, Cybersecurity, Software Engineering, Data Science, Software Security, Machine Learning, Cyber Security Vulnerability Assessment, Large Language Model

Licence