Indonesian Stand-Up Comedy Transcription Dataset

Published: 25 June 2025| Version 1 | DOI: 10.17632/85xgdr7cc7.1
Contributors:
Supriyono Supriyono, Aji Prasetya Wibawa, Suyono Suyono, Fachrul Kurniawan

Description

This dataset contains transcriptions of 3,934 Indonesian stand-up comedy videos sourced from Kompas TV’s YouTube channel. Each entry includes the video title, URL, raw transcript, cleaned transcript, and the number of laughter events. Transcripts were preprocessed by removing timestamps, non-verbal tags (e.g., [Tawa], [Musik]), and formatting inconsistencies to produce NLP-ready text. The dataset consists of over 2.8 million words and 17,394 audience laughter annotations. It enables research in humor detection, sentiment analysis, speech emotion recognition, and cultural discourse analysis. Data are stored in Excel and can be filtered by metadata such as performer, title, and laughter count. This resource is particularly valuable for researchers working with low-resource languages and spoken entertainment content in Indonesian.

Files

Steps to reproduce

1. Visit the Mendeley Data repository at https://data.mendeley.com/datasets/zjdncn6tkv/1 2. Download the Excel file containing the dataset 3. Open the file to access raw and cleaned transcripts, along with laughter annotations and metadata 4. Filter or sort by video title, laughter count, or source file for targeted analysis 5. Use text processing tools (e.g., Python, Excel, R) to further analyze or model the dataset 6. Refer to variable definitions and preprocessing notes included in the first sheet of the file

Institutions

Universitas Negeri Malang

Categories

Artificial Intelligence, Natural Language Processing

Licence