HVULao_NLP: A Word-Segmented Lao Corpus with Optional POS Tags
Description
The HVULao_NLP dataset is the first publicly released word-segmented corpus for the Lao language, designed to support natural language processing (NLP) research in low-resource settings. It consists of 10,000 training sentences and a 1,000-sentence gold-standard test set, each provided in raw, segmented, and optionally POS-tagged formats. Sentences were collected from trusted Lao governmental, news, and institutional websites, and annotated through a semi-automatic pipeline combining large language models with manual review by native Lao annotators and a linguist. All files are encoded in UTF-8 (NFC). This dataset enables benchmarking of Lao word segmentation and POS tagging, and supports broader applications in multilingual NLP, cross-lingual transfer, and linguistic analysis. Files included Datatrain10k/ – Training set (10,000 sentences; raw, segmented, POS-tagged) Datatest1k/ – Test set (1,000 sentences; raw, segmented, POS-tagged) lao_finetuned_10k/ – Pretrained model for Lao word segmentation (optional) seglao.py – Script for applying the segmentation model requirements.txt – Dependencies (optional, see GitHub for updates) 👉 For detailed instructions and updated tools, please refer to the GitHub repository: https://github.com/HaHVU/HVULao_NLP
Files
Steps to reproduce
1. Use the raw text files as input for training or evaluation. 2. Use the word-segmented files for segmentation experiments. 3. Use the JSON files for POS tagging or other sequence labeling tasks. 4. The fine-tuned model can be loaded with Hugging Face’s transformers library for segmentation. 5. The POS tagging tool can be run with Python and CRF++ following the included instructions.
Categories
Funders
- Hung Vuong University, Phu Tho, Viet NamGrant ID: 25/2024/KHCN (HV25.2024)