RIYE Audio Dataset: A Multidialectal Speech Corpus for Low-Resource Language Processing

Published: 29 May 2026| Version 1 | DOI: 10.17632/kt996wpns5.1
Contributors:
,
,

Description

This dataset consists of a curated collection of high-fidelity, field-recorded audio samples developed under the Digiculture RIYE Project's ethnographic survey framework. Created to bridge the digital divide for under-resourced languages, this corpus captures diverse regional speech patterns, unique tonal variations, and distinct phonetic markers essential for localised speech research. The dataset is designed to support a wide array of machine learning, signal processing, and computational linguistics tasks. Because the audio is formatted for clean feature engineering, it serves as an ideal baseline for researchers developing neural networks, automatic speech recognition (ASR) engines, and lightweight audio classification systems optimised for edge deployment.

Files

Steps to reproduce

To reproduce the results or utilise this dataset for speech analysis, the process begins with the systematic extraction and organisation of the raw audio files. Once the dataset archive is downloaded, it should be extracted into a dedicated local directory. The architecture of the dataset is intentionally designed using a folder-as-class paradigm, meaning that the ground-truth label for every audio sample is defined by the name of the folder in which it resides. This structure simplifies the preprocessing pipeline, as it allows researchers to programmatically map dialect categories directly from the directory tree without needing a complex external database. The next phase involves automating the data ingestion through a script that traverses these directories. A standard script should walk through the root folder, treating each subfolder name as a unique class or dialect category. As the script iterates through each folder, it collects the audio files and pairs them with their respective directory name as the target label. This approach is highly compatible with modern machine learning frameworks, where data loaders can be configured to "scan and label" automatically, ensuring that the integrity of the dialect classification remains consistent across different training environments. Once the files are mapped, the audio data should be standardised to ensure uniform input for a model. This typically involves loading the waveforms at a consistent sampling rate and applying padding or truncation to ensure every sample has the same temporal duration. Because the labels are derived from the folder names, it is easy to perform a stratified split to ensure that the training, validation, and testing sets all contain a proportional number of samples from each dialect folder. This straightforward workflow ensures that the transition from raw field recordings to a structured training-ready dataset is both efficient and error-free.

Institutions

Categories

Linguistics, Computer Science, Artificial Intelligence, Natural Language Processing, Speech Analysis

Funders

  • Tertiary Education Trust Fund [TETFUND]
    Grant ID: 2023 TETFund/NRF/HSS/HST_0040

Licence