RAWDysPeech: A Preprocessed Raw Audio Dataset For Speech Dysarthria

Published: 11 November 2024| Version 1 | DOI: 10.17632/3mhnr7frht.1
Contributor:
Arya Shah

Description

RAWDysPeech: A Preprocessed Raw Audio Dataset For Speech Dysarthria is a Speech Dysarthria Dataset for the applicaton of Audio Classification, Speech Detection and similar avenues of research in ASR. RAWDysPeech consists of raw audio files segregated into two classes: 1 and 0, where 1 is for speech involving Dysarthria and 0 is for normal speech. We combine and preprocess some of the most popular speech datasets available open sourced. TORGO, UASPEECH, Ultrax, EasyCall are a few to be named. Here's a brief description of the steps taken to preprocess and combine and we also encourage you to cite the original authors if this dataset helps in your research. -------------------------------------------------------- This dataset provides preprocessed speech recordings from the UASPEECH database, specifically enhanced for machine learning applications using advanced noise reduction and signal processing techniques. Dataset Description The dataset contains audio recordings that have been processed using: I. FFT-based noise reduction: Hanning window application for better frequency analysis 16-bit audio depth processing 44.1 kHz sampling rate[1] Stereo channel support with dual MEMS microphone configuration[2] II. Preprocessing Steps Signal Processing Background noise subtraction using ambient noise sampling Frequency spectrum analysis with FFT Amplitude scaling and normalization Single-sided FFT amplitude doubling for accurate frequency representation[1] III. Audio Parameters Bit Depth: 16-bit (pyaudio.paInt16) Sample Rate: 44.1 kHz Buffer Size: 44100 frames Channel Configuration: Supports both mono and stereo recording[2] IV. File Format Audio files are saved in .WAV format Timestamps are included in filenames (YYYY_MM_DD_HH_MM_SS_pyaudio) Data is organized in dedicated data folders with automated directory creation[1] V. Applications Speech Recognition Systems Dysarthric Speech Analysis Audio Classification Tasks Speech Pattern Recognition Acoustic Model Training Technical Implementation The preprocessing pipeline includes real-time audio capture, noise profiling, FFT analysis, and spectrogram generation, making it suitable for both research and practical applications --------------------------------------------------- Citations: [1] Heejin Kim, Mark Hasegawa Johnson, Jonathan Gunderson, Adrienne Perlman, Thomas Huang, Kenneth Watkin, Simone Frame, Harsh Vardhan Sharma, Xi Zhou, March 17, 2023, "UASpeech", IEEE Dataport, doi: https://dx.doi.org/10.21227/f9tc-ab45. [2] Rudzicz, F., Namasivayam, A.K., Wolff, T. (2012) The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), pages 523--541. [3] Shah, Arya; Qureshi, Aymen; Polprasert, Chantri (2024), “ADAPTIVE: A Novel Dataset For Acoustic DysArthria deTection through temPoral Inference and Voice Engineering”, Mendeley Data, V1, doi: 10.17632/j5bgddf6rp.1

Files

Steps to reproduce

The steps to reproduce include: 1. Procure the Dataset files from TORGO, UASPEECH and similar Speech Dysarthria Databases 2. Organize the audio files in two folders: 1 and 0 3. We provide a baseline script for feature engineering and also provide the preprocessed feature dataset here: - https://github.com/aryashah2k/Acoustic-DysArthria-deTection-through-temPoral-Inference-and-Voice-Engineering - https://data.mendeley.com/datasets/j5bgddf6rp/1 4. If you wish to preprocess on your own, we still provide some baseline steps to take into consideration below, along with some test scripts present here: - https://www.kaggle.com/datasets/aryashah2k/noise-reduced-uaspeech-dysarthria-dataset FFT-based noise reduction - Hanning window application for better frequency analysis - 16-bit audio depth processing - 44.1 kHz sampling rate[1] - Stereo channel support with dual MEMS microphone configuration[2] Preprocessing Steps Signal Processing - Background noise subtraction using ambient noise sampling - Frequency spectrum analysis with FFT - Amplitude scaling and normalization - Single-sided FFT amplitude doubling for accurate frequency representation[1] Audio Parameters - Bit Depth: 16-bit (pyaudio.paInt16) - Sample Rate: 44.1 kHz - Buffer Size: 44100 frames - Channel Configuration: Supports both mono and stereo recording[2] File Format - Audio files are saved in .WAV format - Timestamps are included in filenames (YYYY_MM_DD_HH_MM_SS_pyaudio) - Data is organized in dedicated data folders with automated directory creation If this work helps you in your research, please make sure to cite the following sources in your work: 1. Shah, Arya; Qureshi, Aymen; Polprasert, Chantri (2024), “ADAPTIVE: A Novel Dataset For Acoustic DysArthria deTection through temPoral Inference and Voice Engineering”, Mendeley Data, V1, doi: 10.17632/j5bgddf6rp.1 2. Heejin Kim, Mark Hasegawa Johnson, Jonathan Gunderson, Adrienne Perlman, Thomas Huang, Kenneth Watkin, Simone Frame, Harsh Vardhan Sharma, Xi Zhou, March 17, 2023, "UASpeech", IEEE Dataport, doi: https://dx.doi.org/10.21227/f9tc-ab45. 3. Rudzicz, F., Namasivayam, A.K., Wolff, T. (2012) The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), pages 523--541.

Institutions

Asian Institute of Technology

Categories

Machine Learning, Audio Signal Processing, Speech Disorder, Dysarthria, Recurrent Neural Network, Deep Learning

Licence