MadureseSet: Madurese-Indonesian Dataset

Name: MadureseSet: Madurese-Indonesian Dataset
Creator: Noor Ifada
Published: 2023-08-14T02:17:14.791Z
Keywords: Computer Science, Natural Language Processing

Ifada, Noor; Rachman, Fika Hastarita; M, Wildan; Wahyuni, Sri; Pawitra, Adrian

doi:10.17632/nvc3rsf53b.5

MadureseSet: Madurese-Indonesian Dataset

Published: 14 August 2023| Version 5 | DOI: 10.17632/nvc3rsf53b.5

Contributors:

Noor Ifada, Fika Hastarita Rachman, Wildan M, Sri Wahyuni, Adrian Pawitra

Description

MadureseSet is a digitized version of the physical document of Kamus Lengkap Bahasa MaduraIndonesia (The Complete Dictionary of Madurese-Indonesian). It stores the list of lemmata in Madurese, i.e., 17809 basic lemmata and 53722 substitution lemmata, and their translation in Indonesian. The details of each lemma may include its pronunciation, part of speech, synonym and homonym relations, speech level, dialect, and loanword. The framework of dataset creation consists of three stages. First, the data extraction stage processes the physical document results to produce corrected data in a text file. Second, the data structural review stage processes the text file in terms of the paragraph, homonym, synonym, linguistic, poem, short poem, proverb, and metaphor structures to create the data structure that best represents the information in the dictionary. Finally, the database construction stage builds the physical data model and populates the MadureseSet database data. MadureseSet is validated by a Madurese language expert who is also the author of the physical document source of this dataset. Thus, this dataset can be a primary source for Natural Language Processing (NLP) research, especially for the Madurese language. Please cite the following paper to acknowledge use of the dataset in publications: Ifada, N., Rachman, F.H., Syauqy, M.W.M.A., Wahyuni, S. and Pawitra, A., 2023. MadureseSet: Madurese-Indonesian Dataset. Data in Brief, 48, p.109035. DOI: https://doi.org/10.1016/j.dib.2023.109035

Files

Steps to reproduce

To generate the dataset in the local system, install MySQL workbench (https://www.mysql.com/products/workbench/) as the MySql administrative visual tool. Instructions to import the dataset: 1. Go to "Server" menu 2. Select "Data Import" option 3. Select "Import from Self-Contained File" radio button 4. Click the "three dots" button to browse and open the "madureseSet.mysql" file 5. Click "Start Import" button Note: You may use any other appropriate MySQL administrative tools of your preference.

Institutions

Universitas Trunojoyo Madura Fakultas Teknik

Funders

Ministry of Education and Culture
Indonesia
Grant ID: 254/E5/PG.02.00.PT/2022 and 2466/UN46.4.1/PT.01.03/2022

MadureseSet: Madurese-Indonesian Dataset

Description

Files

Steps to reproduce

Institutions

Categories

Funders

Related Links

Licence