HVULao_NLP: A Word-Segmented Lao Corpus with Optional POS Tags

Published: 2 September 2025| Version 2 | DOI: 10.17632/5zwym7kwn8.2
Contributors:
,
,
,

Description

The HVULao_NLP dataset is the first publicly released word-segmented corpus for the Lao language, designed to support natural language processing (NLP) research in low-resource settings. It consists of 10,000 training sentences and a 1,000-sentence gold-standard test set, each provided in raw, segmented, and optionally POS-tagged formats. Sentences were collected from trusted Lao governmental, news, and institutional websites, and annotated through a semi-automatic pipeline combining large language models with manual review by native Lao annotators and a linguist. All files are encoded in UTF-8 (NFC). This dataset enables benchmarking of Lao word segmentation and POS tagging, and supports broader applications in multilingual NLP, cross-lingual transfer, and linguistic analysis. Files included Datatrain10k/ – Training set (10,000 sentences; raw, segmented, POS-tagged) Datatest1k/ – Test set (1,000 sentences; raw, segmented, POS-tagged) lao_finetuned_10k/ – Pretrained model for Lao word segmentation (optional) seglao.py – Script for applying the segmentation model requirements.txt – Dependencies (optional, see GitHub for updates) 👉 For detailed instructions and updated tools, please refer to the GitHub repository: https://github.com/HaHVU/HVULao_NLP

Files

Steps to reproduce

1. Use the raw text files as input for training or evaluation. 2. Use the word-segmented files for segmentation experiments. 3. Use the JSON files for POS tagging or other sequence labeling tasks. 4. The fine-tuned model can be loaded with Hugging Face’s transformers library for segmentation. 5. The POS tagging tool can be run with Python and CRF++ following the included instructions.

Categories

Computer Science, Artificial Intelligence, Computational Linguistics, Natural Language Processing, Low-Resource LLM

Funders

  • Hung Vuong University, Phu Tho, Viet Nam
    Grant ID: 25/2024/KHCN (HV25.2024)

Licence