Pro Kabaddi League Dataset (PKL)
Description
This dataset contains 32,341 viewer comments collected from two YouTube channels associated with the Pro Kabaddi League (PKL) — the official Pro Kabaddi League channel and the Star Sports channel — spanning six consecutive seasons (Seasons 7 through 12, approximately 2019 to 2024). The PKL, established in 2014, is one of India's most prominent professional sports leagues and is built around kabaddi, a traditional indigenous contact sport. Comments were gathered from a range of video content including live match broadcasts, recorded replays, highlight clips, short-form videos, and post-match press conferences. Raw comments — including those written in Hindi, Tamil, Telugu, Marathi, and other Indian regional languages — were translated into English using the OpenAI API, cleaned, and filtered to remove very short entries and special symbols. Each comment was then classified into one of three quality tiers: Rich (detailed, player- or tactic-specific content), Moderate (some context but limited depth), and Poor (generic or off-topic). The dataset is structured as a single CSV file with two tabs — one per channel — and each row contains the original comment, the preprocessed version, season label, video type, timestamp, like count, and classification label. The dataset is suitable for research in natural language processing, sentiment analysis, sports fan engagement, digital media studies, sports marketing, and multilingual text analytics. It is the first publicly available annotated comment corpus from the Pro Kabaddi League.
Files
Steps to reproduce
**Data Collection, Processing and Classification** *Sources* Comments were collected from two publicly accessible YouTube channels: the official Pro Kabaddi League channel (https://www.youtube.com/@ProKabaddi) and the Star Sports channel (https://www.youtube.com/@starsports), covering Seasons 7 through 12 (approximately 2019–2024). Content types included live match broadcasts, recorded replays, highlight clips, short-form videos, and post-match press conferences. *Collection* A Python script using the YouTube Data API v3 extracted comment threads from all PKL-related videos on both channels. For each comment, the following were recorded: comment text, posting timestamp, video URL, video type, and like count. The scraper paged through all available comments to ensure complete coverage. *Preprocessing* Comments in Hindi, Tamil, Telugu, Marathi, and other regional languages were translated into English using the OpenAI API (GPT-4). Additional cleaning steps included removal of comments with five words or fewer, stripping of emojis and special symbols, and whitespace normalisation. Both the original and cleaned versions of each comment were retained in the dataset. *Classification* Each comment was labelled as Rich, Moderate, or Poor based on its depth and contextual relevance. A semi-automated process was used: the research team manually labelled a reference subset, which then guided AI-assisted classification via the OpenAI API. Outputs were reviewed and corrected iteratively by the authors until consistency was achieved across all seasons and both channels. *Tools Used* Python 3.x — scraping and preprocessing YouTube Data API v3 — comment extraction OpenAI API (GPT-4) — translation and classification CSV / Microsoft Excel — data storage *Output* The final dataset contains 32,341 comments in a single CSV file with two tabs — PKL channel (26,863 comments) and Star Sports channel (5,478 comments) — each with nine fields: S.No, Season, Timestamp, URL, Video Type, Likes, Original Text, Preprocessed Text, and Classification Label.
Institutions
- Great Lakes Institute of ManagementTamil Nadu, Chennai