Audio Dataset of Traditional Portuguese Musical Instruments

Published: 2 January 2026| Version 1 | DOI: 10.17632/yjdfnymgf2.1
Contributors:
Sergio García González,
,
,
,

Description

DESCRIPTION: This dataset is a curated and annotated audio dataset designed specifically for Music Information Retrieval (MIR) tasks in low-resource contexts and for the preservation of intangible cultural heritage. The dataset is derived from the initiative ‘A Música Portuguesa a Gostar Dela Própria’, an extensive ethnographic video library dedicated to documenting Portugal's oral and musical traditions. The corpus contains audio clips of seven instruments representative of various Portuguese regions, including string instruments (Portuguese guitar, Viola Braguesa), woodwind instruments (Pífaro, Gaita de Foles, Saxophone) and free-reed instruments (Concertina, Harmonica). Unlike traditional study datasets, this is characterised by its nature (field recordings). The samples present uncontrolled acoustic conditions, including natural reverberation, ambient noise and variability in microphone distance, which provides high validity for evaluating the robustness of deep learning models. DATASET STATISTICS: The dataset consists of 1,734 audio files (WAV) with the following class distribution: - Concertina: 419 samples - Harmonica (Armónica): 407 samples - Portuguese Bagpipes (Gaita de Foles): 375 samples - Saxophone (Saxofone): 190 samples - Portuguese Guitar (Guitarra Portuguesa): 160 samples - Viola Braguesa: 92 samples - Pífaro: 91 samples METADATA STRUCTURE: The accompanying CSV file contains the following columns: - filename: Name of the audio file. - target: Numeric class identifier (0-6). - category: Instrument name (label). - take: Unique recording session identifier. (Note: It is highly recommended to use this field for take-stratified splitting to prevent data leakage between sets).

Files

Steps to reproduce

The dataset was constructed by extracting audio tracks from the 'A Música Portuguesa a Gostar Dela Própria' (MPAGDP) video archive, which were subsequently segmented into 5-second clips and pre-converted to 16kHz mono format to ensure consistency. To utilize this corpus, researchers must first map the relative paths provided in the 'filename' column of the metadata CSV to their local audio directory. Crucially, experimental reproduction requires strict adherence to a take-stratified splitting protocol; users must utilize the 'take' column to group all segments belonging to the same recording session and assign them collectively to either training, validation, or testing sets. This step is mandatory to prevent data leakage caused by overlapping acoustic environments found in the original field recordings.

Institutions

Universidad de Salamanca

Categories

Folklore, Information Retrieval, Music Computing, Audio Recording, Deep Learning, Musicology

Licence