Spell Error Corpus for Minang Language
Description
The Spell Error Corpus for Minang Language (SPEML) is a dataset designed to support spelling error detection and correction for Minangkabau, a low-resource language of Indonesia. The base corpus was derived from a publicly available Indonesian–Minangkabau parallel corpus in the minangNLP GitHub repository. From the aligned sentence pairs, only the Minangkabau column was extracted, yielding 16,372 sentences. The texts were preprocessed through normalization, cleaning, and deduplication, resulting in 16,334 unique Minangkabau sentences. SPEML contains 164,662 misspelled word forms organized into seven error categories. Spelling errors were generated using rule-based procedures covering character insertion (1–3 characters), character deletion (1–3 characters), character substitution (1–3 characters), character transposition (single adjacent swap), punctuation errors, real-word errors, and loanword errors, followed by expert validation for linguistic plausibility. This dataset can be reused to train, test, and benchmark NLP models, including spell checkers and language models, for Minangkabau spelling error detection and correction, as well as to evaluate methods by error type and edit length.
Files
Steps to reproduce
The construction of SPEML followed a structured and reproducible four-stage workflow: 1. Source Text Preparation 16,371 Minangkabau sentences were extracted from the "minangNLP" repository (all_data.xlsx), which provides verified academic reference material. 2. Preprocessing The raw texts underwent cleaning and normalization to ensure data integrity and quality. This process primarily involved filtering out null or missing values (NaN) and performing sentence deduplication. A total of 37 entries (encompassing both null records and redundant duplicates) were excluded during this stage, resulting in a finalized base corpus of 16,334 unique sentences. 3. Rule-Based Spelling Error Generation Controlled modifications were applied at the word level using a Python-based framework. To maintain plausibility, the generation followed these workflows: - Character Insertion Errors Errors were generated by duplicating one randomly chosen character within a token 1, 2, or 3 times. - Deletion Errors and Substitution Errors (Length-dependent branching L) - 2≤L<8: 1 character modification. - 8≤L<12: Up to 2 character modifications (including vowel deletion). - L≥12: Up to 3 character modifications. - Transposition Errors The system takes a document and performs tokenization to split the sentence into a list of tokens (words). Next, the system selects one token at random from the token list. Within the selected token, it randomly picks two adjacent characters and swaps their positions. - Punctuation errors involved replacing standard marks (e.g., . , ? !) with other randomly chosen marks - Real-word Errors The system finds the closest valid word in the vocabulary using Levenshtein Distance and replaces the original token. Finally, the modified tokens are reassembled into sentences to produce a corpus with realistic errors. - Loanword Errors For each selected token, a decision step is applied: “Is the token a loanword in the loanword dictionary?” If No (N): the token is not treated as a loanword error (it is skipped/left unchanged). If Yes (Y): the system replaces the token according to the loanword dictionary entries (using the predefined loanword variants/misspellings). After replacement, the system returns the tokens to their respective sentences (re-inserting the modified token in its original position). The resulting sentences are collected as the Loanword Error Corpus. 4. Expert Validation To ensure linguistic realism, two Minangkabau language specialists reviewed representative samples from all categories. Each entry was assessed based on three criteria: (1) representation of real-world spelling variations, (2) maintenance of structural word integrity, and (3) absence of linguistically impossible character combinations. Based on this feedback, generation rules were iteratively refined before the final dataset was produced.
Institutions
- Universitas Ahmad DahlanYogyakarta, Yogyakarta