Balinese Text-to-Speech Dataset as Digital Cultural Heritage
Description
This dataset is a collection of audio recordings from native Balinese speakers. This dataset consists of 1187 recordings covering various levels of Balinese, such as Alus Singgih, Alus Mider, Andap, Mider, and Alus Sor. In addition, this dataset also records phrases and alphabets to provide a wider linguistic variation. This dataset is designed to support the development of various voice-based applications, including Text-to-Speech (TTS) systems, automatic speech recognition, and speech-to-text conversion. This dataset also supports research in the field of natural language processing (NLP), especially for regional languages that still have minimal digital representation. The use of this dataset is expected to enrich voice-based technology and strengthen the existence of Balinese in the digital era. With this data, researchers and developers can create systems that support the preservation of regional languages as part of Indonesia's cultural heritage.
Files
Steps to reproduce
To collect data for this study, we followed a series of well-structured methods and protocols to ensure the quality and consistency of the Balinese Text-to-Speech (TTS) dataset. First, we chose Denpasar, Bali, as the study area because it is the cultural and linguistic center of Bali, with native speakers fluent in various politeness levels of Balinese. Audio data were obtained through live recording using a high-quality condenser microphone and a digital recording application, with technical parameter settings such as a sampling rate of 44.1 kHz and a bit depth of 24-bit, in WAV format to maintain sound quality. The acquisition process was carried out by a native speaker, I Gede Ngurah Arya Wira Putra, who has a Badung accent, and covers various levels of Balinese. After recording, we processed the data using Audacity software to remove noise and ensure clean audio quality. Next, the recordings were divided and labeled based on language type and phrase, for easy organization and analysis. This data was then stored in a structured repository for easy accessibility and management, ready for use in TTS applications or further research. All these steps can be repeated by following the same procedure to generate similar datasets.