INMAD (Indonesia - Madurese) Sentences Dataset

Published: 25 June 2025| Version 1 | DOI: 10.17632/dt7d8dmfgs.1
Contributor:
Fairuz Iqbal Maulana

Description

The INMAD (Indonesia-Madura) Sentence Dataset is a collection of parallel sentences in Indonesian and Madura intended to support the development of natural language processing (NLP) technology for languages with limited resources. This dataset combines three main sources from the indonlp website: Korpus Nusantara (1,100 sentences), NusaX MT (994 sentences), and Nusa Paragraf (9,449 sentences). When combined, these sources produce a total of 11,543 parallel sentences, which are then consolidated into a single CSV file named Dataset Raw. To increase the variety and quantity of sentences in the dataset, data augmentation was performed using back-translation with the MarianMT model through Python programming, which involved translating Indonesian sentences into English and then back into Indonesian. This process doubled the number of sentences to 23,086 Indonesian-Madura translation lines. 11,543 parallel sentences were then manually translated into Madura at the 'engghi-enten' level by expert translators to ensure the quality of the translation. The results of the Indonesian data augmentation, consisting of 11,543 parallel sentences, use the same Madura translations from the 'engghi-enten' level in the first translation. This dataset is available in CSV format, consisting of a total of 23,086 lines, with columns labelled 'Indonesia' and 'Madura', and stored using UTF-8 encoding. The entire process aims to produce a high-quality parallel dataset that can be used for various linguistic and language technology applications, including machine translation training, language preservation, and digital dictionary development.

Files

Steps to reproduce

This corpus consists of more than 23,000 Indonesian-Madurese translation sentences, separated by tabs. All sentences were manually translated into Madurese at the engghi-enten level by expert translators to ensure the quality and naturalness of the translations. We used a .csv format with tabs as separators for this dataset.

Institutions

  • Bina Nusantara University

Categories

Computer Science, Natural Language Processing, Corpus Linguistics, Bahasa Indonesia

Licence