Indonesian Friday Sermon Transcript Dataset from YouTube and Web Sources

Published: 16 March 2026| Version 1 | DOI: 10.17632/4z9dxb5j2s.1
Contributors:
,
,
,
,
,

Description

This dataset presents a comprehensive collection of 213 Indonesian Friday sermon (khutbah Jumat) transcripts sourced from popular YouTube Islamic channels during 2025-2026, totaling over 2.6 million characters of authentic religious discourse. Each entry includes complete verbatim transcripts alongside rich metadata: sermon titles, YouTube URLs, viewer counts (ranging from 186 to 484K views), and video durations (averaging 18:43 minutes). The Excel file (1.07 MB) captures diverse themes including eschatology (death and afterlife), practical Islamic ethics (patience, charity, family relations), spiritual development (taqwa, repentance), and contemporary issues (Ramadan preparation, social media). Collected via automated speech-to-text with manual verification, this corpus is ideal for NLP research—particularly text summarization (mBART-50/IndoBERT fine-tuning), topic modeling, sentiment analysis of religious speech, and speech-to-text validation in Indonesian dialects. Licensed under CC BY 4.0, it serves as a valuable baseline for computational linguistics and Islamic studies, enabling ROUGE metric benchmarking and transformer model pre-training on formal religious oratory.

Files

Steps to reproduce

To reproduce this Indonesian Friday Sermon Dataset, first identify popular Islamic YouTube channels using search terms like "khutbah jumat ustadz 2025" to collect 213 videos from preachers such as Ustadz Khalid Basalamah and Ustadz Sholeh Al Jufri. Use yt-dlp to batch-extract metadata (titles, URLs, view counts, durations) with the command yt-dlp --flat-playlist --print title:duration:view_count:url, then download audio files for each video. Apply automated speech-to-text transcription using OpenAI Whisper or Google Speech Recognition (language='id-ID') to generate verbatim transcripts, achieving ~98% accuracy through manual spot-checking of 10% of entries. Compile data into a pandas DataFrame, clean duplicates based on URL/title, format view counts (e.g., "178K"), add sequential numbering (1-213), and export to Excel (.xlsx) with columns: No., Judul Video, URL, Jumlah Viewers, Durasi, Transkrip. Preserve natural speech patterns including Arabic terms and repetitions; the process takes 2-3 weeks on 16GB RAM hardware, yielding the identical 1.07 MB dataset with 2.6M+ characters ready for NLP analysis.

Categories

Transcription, Natural Language Processing

Licence