B-SER: A Bangla speech emotion recognition dataset
Description
B-SER is an audio-speech emotion recognition dataset for the Bangla language. The developed dataset consists of voice data from 34 speakers from diverse age groups between 19 to 47 (mean = 28.75 and Standard deviation = 9.346), equally balanced with 17 males and 17 females. This dataset contains 1224 speech-audio data recordings. There are 4 emotional states recorded for three sentences. The 3 sentences are i. ‘বারোটা বেজে গেছে,’ ii. ‘আমি জানতাম এমন কিছু হবে’, and iii. ‘এ কেমন উপহার’. These emotional states are Angry, Happy, Sad, and Surprise. The format of the audio file is a . WAV format. The data files are divided into 34 individual folders. Each folder contains 36 audio recordings of each participating actor. The size of the B-SER dataset is 619 MB. While most of the existing datasets of different languages are recorded inside a closed studio or cover a single sentence, this dataset is collected by recording through smartphones, hence preserving the slightly noisy real-life environment. B-SER is compatible with various shallow machine learning and deep learning architectures such CNN, LSTM, HMM, Transformer, etc. The naming of each audio file of the B-SER dataset is inspired by the RAVDESS dataset. The filename consists of seven two-digit numerical identifiers, separated by hyphens (e.g., 03-01-01-01-02-02-02.wav). Each two-digit numerical identifier defines the level of a different experimental factor. The identifiers are ordered: Modality - Scripted - Name of emotion - Intensity - Statement number - Number of repetition - Actor.wav. For example, the filename “03-01-02-01-03-03-02.wav” refers to: Audio only (03) - Scripted (01) - Sad (02) - Normal intensity (01) - 3rd Statement (03) - 3rd Repetition (03) - 2nd Actor (even number-female actor, odd number-male actor) (02).