A Dataset for Multimodal Music Information Retrieval of Sotho-Tswana Music Videos

Published: 14 June 2024| Version 1 | DOI: 10.17632/7jmgfk4fd9.1
Osondu Oguike,


The dataset is a multimodal dataset for music information retrieval of under-resourced Sotho-Tswana language. The MIR tasks for which the dataset can be used for, include multimodal music sentiment analysis, multimodal music genre classification, multimodal language identification in music videos, and recommender systems. Although the primary source of the musical videos is YouTube, several processes have been performed on the dataset, such as segmentation, separation of the audio modality from the visual modality, and extraction of spectral-based acoustic features, together with enriched manual annotations of the following metadata, language, genre, lyrics and meaning of song. The dataset is ideal for training deep learning models, and the recommended fusion method for combining the features of the audio and visual modalities is the decision-level (late) fusion method. In addition to the musical video clips, part of the dataset includes various CSV files and Jupyter notebooks.


Steps to reproduce

Repository name: Mendeley Data Data identification number: 10.17632/7jmgfk4fd9.1 Direct URL to data: https://data.mendeley.com/drafts/7jmgfk4fd9?folder= Instructions for accessing these data: The raw musical videos were downloaded from YouTube, the audio modality was separated from the visual modality, and spectral based acoustic features were extracted from the audio modality. The segments of the video clips were split into various images/frames, which will be used for training. To use the dataset, follow the following steps. 1. The raw musical videos were downloaded using video downloader site, savefrom.net or use the Jupyter notebook, download.ipnb to download the videos. 2. Each downloaded video was split into fifteen seconds equal video segment, using the Jupyter notebook, Split.ipnb 3. Because the downloaded video clips have different durations, use Jupyter notebook, SplitVideo.ipynb, together with the text file, split1.txt, to split each of the downloaded video clip into equal fifteen seconds segments of video clips. This will ensure that all the video clips are in the same level playing field. 4. Change the name of each segment of the video clip so that it will be the same as the corresponding name in the VideoSegment.csv file. 5. Store all the segmented video clips in a new folder called Video_Clips. 6. Use the Jupyter notebook, separate.jpynb, to separate the audio modality of each of the segmented video clip, from the video. 7. Change the name of each segment of the audio file so that it will be the same as the corresponding name in the csv file, AudioSegments.csv. 8. Store all the segments of audio file in a new folder called Audio_Clips. 9. Use the Jupyter notebook, Video_Images.ipynb to generate the frames/images from the segmented video clips. This will generate the csv file called Video_Images.csv. Due to the limitation of computing resources, when training the deep learning models, you may not use all the segmented video clips and segmented audio clips. If you have decided to use some of them, you need to delete the ones that you are not using from these csv files, VideoSegment.csv and AudioSegments.csv. However, if you are using a powerful computer system, you can use all the segmented video clips and audio clips. 10. Store all the generated frames/images in a new folder called Video_Frames. 11. Based on the MIR task, use appropriate deep learning models to train the audio, textual and visual modalities of the dataset, with late fusion method.


University of Johannesburg School of Electrical Engineering, University of Nigeria Faculty of Engineering


Computer Science, Artificial Intelligence, Natural Language Processing, Genre Analysis, Deep Learning, Sentiment Analysis