User Story Ambiguity Dataset: A Comprehensive Research Resource

Published: 11 July 2025| Version 1 | DOI: 10.17632/wz9spjy4v5.1
Contributor:
Cornelius Okechukwu

Description

This dataset represents the largest empirical collection of user story ambiguities, encompassing 12,847 authentic user stories from eight companies spanning finance, healthcare, e-commerce, telecommunications, and manufacturing domains. The collection addresses a critical gap in requirements engineering research by providing systematically annotated real-world data for investigating ambiguity patterns in agile development environments. The dataset reveals significant organisational variation, with ambiguity rates ranging from 15.3% to 67.8% across companies, reflecting genuine differences in agile maturity and domain complexity. Seven distinct ambiguity types were identified, with semantic ambiguities being most prevalent (34.2%), followed by scope (28.7%) and actor ambiguities (19.4%). This distribution provides crucial insights into the most common sources of requirements confusion in practice. Structured across five interconnected sheets, the dataset includes comprehensive attributes covering team characteristics, project outcomes, and temporal progression data. Notably, the temporal analysis demonstrates a 23.4% average improvement in story quality over 12-month periods, providing empirical evidence of organisational learning effects in requirements practices. The collection serves multiple research purposes, from training machine learning models for automated ambiguity detection to validating requirements engineering frameworks across different organisational contexts. Strong statistical foundations underpin the dataset, with robust correlations between team experience (r=-0.73) and domain complexity (r=0.52) with ambiguity rates, supported by high inter-rater reliability (α=0.77). This resource enables researchers to conduct comparative studies, develop evidence-based tools, and advance our understanding of requirements quality in agile environments, making it an invaluable asset for the empirical software engineering community.

Files

Institutions

Univerzita Tomase Bati ve Zline Fakulta Aplikovane Informatiky

Categories

Software Engineering, Requirement Engineering, Natural Language Processing, Machine Learning, Empirical Study of Software Engineering, Ambiguity

Licence