Engineered dataset for Cross-Project Requirement Traceability in Natural Language Artefacts
Description
The compiled dataset for cross-project requirement traceability by leveraging contrastive learning techniques on natural language artefacts contains 15,872 total requirements across 37 projects and 7,624 validated cross-project links, with multiple Excel sheets for different data views. Data sources replicated from include: (1) Open source repositories (25 projects); (2) An Industrial dataset (12 proprietary projects) with 3 industry partners with 20-35 requirements per project; and (3) Benchmark Datasets- (a) PURE: 79 smaller research projects (5-15 requirements each) and PROMISE NFR: 15 projects focused on non-functional requirements (40 requirements each) with comprehensive NFR coverage spread across the categories: performance, security, usability, reliability, scalability, and maintainability. The implemented dataset evaluates traceability links across different projects, thereby contributing to both software engineering and natural language processing domains by establishing a more robust approach to cross-project traceability that can support knowledge transfer and reuse across software projects. Features of the dataset: Multiple data sheets in the Excel file for detailed analysis of requirements. Cross-project relationships with confidence scoring and validation status Temporal data with creation dates and project timelines Multi-dimensional classification (functional/non-functional, priority, complexity) Stakeholder attribution and tagging system