HausaSRS: A Hausa-English Parallel Corpus of Software Requirements Specifications
Description
This study hypothesizes that software requirements written in English can be systematically transformed into a high-quality Hausa Requirements Engineering dataset through a controlled pipeline that combines document harvesting, domain filtering, glossary-guided translation, weak annotation, and expert validation. A related hypothesis is that, in a low-resource setting, combining automated corpus construction with Human-in-the-Loop review can produce data of sufficient quality for downstream NLP tasks such as translation, FR/NFR classification, and token-level entity extraction. The data consists of a parallel and annotated corpus derived from ~350 Software Requirements Specification (SRS) documents collected across the health, education, and finance domains. These source documents were processed from PDF & DOCX formats. Text was extracted using pdfplumber and python-docx, cleaned with rule-based preprocessing to remove non-semantic artifacts such as page numbers, repeated spacing, URLs, and boilerplate labels, and filtered to retain requirement-like content using requirement-engineering keywords and modal patterns. English segments were retained, technical terms were anchored through a custom SRS glossary and named-entity recognition, and the retained text was translated into Hausa. Hausa outputs were normalized and weakly annotated with BIO tags and FR/NFR labels. Synthetic IEEE-style Hausa requirement templates were also introduced to strengthen corpus structure. After cleaning, deduplication, and removal of malformed rows, the dataset was stored as a cleaned silver corpus and partitioned into train, validation, and test subsets. Results show it is feasible to construct a domain-specific Hausa RE resource from heterogeneous SRS documents using a semi-automated workflow. It also shows that a glossary-aware translation and annotation strategy can preserve important software engineering concepts such as actors, system entities, constraints, and quality attributes in Hausa. Notably, automated annotation alone is not sufficient for reliable low-resource RE data; expert correction by a Hausa-speaking NLP specialist was necessary to refine mistranslations, resolve ambiguous labels, and correct token boundaries. This confirms the importance of expert validation in producing a gold-standard corpus from an initially silver dataset. The dataset is a structured representation of software requirements knowledge in Hausa, aligned with common RE tasks. The parallel English–Hausa component supports machine translation and cross-lingual modeling. The FR/NFR labels support requirement classification, while the BIO tags support sequence labeling and information extraction. Researchers can use the data for translation benchmarking, low-resource RE classification, domain-adaptive pretraining, or Hausa-specific entity extraction. More broadly, the data demonstrates a reproducible pathway for creating Requirements Engineering datasets in under-resourced languages.
Files
Steps to reproduce
The data were created through a structured, semi-automated workflow designed to produce high-quality requirements engineering samples for a low-resource language setting. Initially, approximately 350 Software Requirements Specification (SRS) documents were collected from academic and industry sources across healthcare, education, and finance domains. Documents were obtained in PDF and DOCX formats and converted to plain text using automated extraction tools. A preprocessing stage was applied to remove noise, normalize formatting, and filter out non-requirement content. Relevant requirement statements were identified using a combination of rule-based filtering and linguistic heuristics aligned with IEEE-style specifications, particularly the use of modal verbs such as “shall” and “must.” These extracted English sentences were then translated into Hausa using a controlled translation approach supported by a domain-specific glossary to ensure consistent mapping of technical terms. Following translation, the text was normalized to standard Unicode form to correctly represent Hausa-specific characters. The data were then annotated at two levels: sentence-level classification into functional and non-functional requirements, and token-level labeling using a BIO scheme to identify key entities such as actor, action, and object. Annotation was performed manually by a domain-aware annotator using spreadsheet-based tools to ensure clarity and traceability. To ensure quality and reproducibility, a human-in-the-loop validation phase was conducted, where a Hausa-speaking expert reviewed all entries, corrected translations, and verified labels and entity boundaries. Only validated samples were retained in the final dataset. The overall workflow combines automated preprocessing with expert validation, making it reproducible using standard NLP tools, spreadsheet software, and clearly defined annotation guidelines.
Institutions
- Nile University of NigeriaFCT, Abuja