Parallel Corpus Dataset of Indonesian and Bengkulu Malay Language

Published: 15 March 2024| Version 1 | DOI: 10.17632/tnk42hhjjk.1
Bella Miranda


This is a parallel corpus dataset in which each sentence has a corresponding pair. This dataset is specifically created to facilitate the use of machine learning in the field of translation. The dataset is meticulously compiled to encompass various contexts and topics, enabling users to expand their understanding of the use of everyday language, language variations, as well as expressions and typical idioms from both languages. This dataset is independently compiled, with a total of 5261 sentences formed through the collection of online sources for the Indonesian language. Additionally, it utilizes the Malay language dictionary from the Bengkulu library, located in the city of Bengkulu, Bengkulu Province, Indonesia.


Steps to reproduce

The dataset is in two folders: Original data component: This folder contains three files, namely corpus.csv,, and corpus.bkl. The corpus.csv file is a parallel corpus that includes texts in both Indonesian and Bengkulu Malay languages. This corpus is specifically designed to provide parallel data between the two languages, facilitating cross-language research and linguistic analysis. The file contains all raw texts in the Indonesian language, with each sentence placed on a separate line. Meanwhile, the corpus.bkl file includes all raw texts in Bengkulu Malay, with each sentence also placed on a separate line. Training data component: Inside this folder, there are five files for and five files for corpus.bkl. This folder functions as a storage location for a dataset, each file containing 1000 sentences, used as test material in the model training process.


Universitas Ahmad Dahlan


Natural Language Processing, Machine Translation, Corpus Analysis, Language