Indonesian Stand-Up Comedy Transcription Dataset
Description
This dataset contains transcriptions of 3,934 Indonesian stand-up comedy videos sourced from Kompas TV’s YouTube channel. Each entry includes the video title, URL, raw transcript, cleaned transcript, and the number of laughter events. Transcripts were preprocessed by removing timestamps, non-verbal tags (e.g., [Tawa], [Musik]), and formatting inconsistencies to produce NLP-ready text. The dataset consists of over 2.8 million words and 17,394 audience laughter annotations. It enables research in humor detection, sentiment analysis, speech emotion recognition, and cultural discourse analysis. Data are stored in Excel and can be filtered by metadata such as performer, title, and laughter count. This resource is particularly valuable for researchers working with low-resource languages and spoken entertainment content in Indonesian.
Files
Steps to reproduce
1. Visit the Mendeley Data repository at https://data.mendeley.com/datasets/zjdncn6tkv/1 2. Download the Excel file containing the dataset 3. Open the file to access raw and cleaned transcripts, along with laughter annotations and metadata 4. Filter or sort by video title, laughter count, or source file for targeted analysis 5. Use text processing tools (e.g., Python, Excel, R) to further analyze or model the dataset 6. Refer to variable definitions and preprocessing notes included in the first sheet of the file