HVULao_NLP: A Word-Segmented and POS-Tagged Lao Corpus

Published: 1 September 2025| Version 1 | DOI: 10.17632/5zwym7kwn8.1
Contributors:
,
,
,

Description

The HVULao_NLP project is dedicated to sharing datasets and tools for Lao Natural Language Processing (NLP), developed and maintained by the research team at Hung Vuong University (HVU), Phu Tho, Vietnam. This project is supported by Hung Vuong University with the aim of advancing research and applications in low-resource language processing, particularly for the Lao language. --- 📁 Datasets This release provides a semi-automatically constructed corpus consisting of Lao sentences that have been **word-segmented** and **part-of-speech (POS) tagged**. It is designed to support a wide range of NLP applications, including language modeling, sequence labeling, linguistic research, and the development of Lao language tools. - **Datatest1k/** – Test set (1,000 Lao sentences) - `testorgin1000.txt`: Original raw sentences (UTF-8, one sentence per line). - `testsegsent_1000.txt`: Word-segmented version aligned 1-to-1 with the raw file (tokens separated by spaces). - `testtag1k.json`: Word-segmented and POS-tagged sentences, generated using large language models (LLMs) and manually reviewed by native linguists. - **Datatrain10k/** – Training set (10,000 Lao sentences) - `10ktrainorin.txt`: Original raw sentences (UTF-8, one sentence per line). - `10ksegmented.txt`: Word-segmented version aligned 1-to-1 with the raw file. - `10ktraintag.json`: Word-segmented and POS-tagged sentences, generated using the same method as the test set. - **lao_finetuned_10k/** – A fine-tuned transformer-based model for Lao word segmentation, compatible with Hugging Face’s `transformers` library. All data files are encoded in **UTF-8 (NFC)** and prepared for direct use in NLP pipelines. --- 📁 The Lao sentence segmentation tool A command-line tool for Lao word segmentation built with a fine-tuned Hugging Face `transformers` model and PyTorch. **Features** - Accurate Lao word segmentation using a pre-trained model - Simple command-line usage - GPU support (if available) **Example usage** ```bash python3 segment_lao.py -i ./data/lao_raw.txt -o ./output/lao_segmented.txt 📁 The Lao sentence POS tagging tool A POS tagging tool for segmented Lao text, implemented with Python and CRF++. **Example usage** python3 Pos_tagging.py ./Test/lao_sentences_segmented.txt Test1 --- 📚 Usage The HVULao_NLP dataset and tools are intended for: - Training and evaluating sequence labeling models (e.g., CRF, BiLSTM, mBERT) - Developing Lao NLP tools (e.g., POS taggers, tokenizers) - Conducting linguistic and computational research on Lao

Files

Steps to reproduce

1. Use the raw text files as input for training or evaluation. 2. Use the word-segmented files for segmentation experiments. 3. Use the JSON files for POS tagging or other sequence labeling tasks. 4. The fine-tuned model can be loaded with Hugging Face’s transformers library for segmentation. 5. The POS tagging tool can be run with Python and CRF++ following the included instructions.

Categories

Computer Science, Artificial Intelligence, Computational Linguistics, Natural Language Processing, Low-Resource LLM

Funders

  • Hung Vuong University, Phu Tho, Viet Nam
    Grant ID: 25/2024/KHCN (HV25.2024)

Licence