PolCSBD :Political Counter Speech BD

Published: 27 April 2026| Version 2 | DOI: 10.17632/ddvzpjkws7.2
Contributors:
Chowdhury Mohammad Mehedi,

Description

This dataset (PolCSBD) was developed to address a critical gap in natural language processing: the detection of political counter-speech in low-resource, code-mixed languages. Our foundational hypothesis was that counter-speech cannot be accurately classified by looking at a single comment in isolation; it fundamentally requires the context of the preceding statement. Additionally, we hypothesized that social media users in Bangladesh heavily use "Banglish" (a phonetic mix of English alphabets and Bengali vocabulary) alongside native Bengali script, which creates a major barrier for standard text classification models. The dataset provides over 10,000 contextual pairs of social media text extracted from political discussions. Each row is structured as a direct conversation, containing a "parent_text" (the initial statement) and a "reply_text" (the direct response). The data demonstrates the complex linguistic reality of the region, featuring native Bengali script, fully Romanized Bengali, and hybrid sentences. It effectively captures how internet users employ historical references, aggressive debate tactics, and sarcasm to challenge political narratives. How to interpret and use the data: This dataset is heavily optimized and provided in a machine-learning-ready format, making it ideal for researchers looking to train, fine-tune, or benchmark Transformer models (such as mBERT, XLM-RoBERTa, or BanglaBERT). It contains exactly three columns: parent_text: The contextual baseline statement, which has been preprocessed to remove noise. reply_text: The responding statement, similarly preprocessed. label: A binary integer classification. A value of '1' indicates Counter-Speech (the reply actively disputes, corrects, or challenges the parent text with a counter-narrative). A value of '0' indicates Non-Counter Speech (the reply simply agrees, adds unrelated noise, or resorts to isolated insults without addressing the argument). Because the text has already undergone strict normalization (noise removal and lowercasing), AI practitioners can directly feed this CSV into tokenizers and neural networks without needing to build complex data-cleaning pipelines from scratch.

Files

Steps to reproduce

The original text data was collected from the comment sections of popular Bangladeshi political YouTube channels, talk shows, and news networks. We targeted videos that discussed national elections, government policies, and political parties, as these environments naturally produce high volumes of polarized debate and counter-narratives. Extraction and Structuring Workflow: We utilized the YouTube Data API v3 to pull the initial raw text. Using a Python script, we extracted top-level parent comments and iteratively matched them with their corresponding nested replies. This pairing was a strict requirement to maintain the conversational context necessary for counter-speech analysis. Preprocessing and Normalization Protocol: To make the dataset immediately usable for deep learning applications, we applied a rigorous text-cleaning workflow using Python's regular expression (RegEx) libraries. Anonymization & Noise Reduction: We stripped out all user mentions (tags starting with '@') to maintain privacy, and deleted standard hyperlinks (http/www). Emoji Removal: Emojis were removed to force classification models to rely entirely on the textual and logical arguments rather than visual sentiment cues. Text Normalization: We lowercased all English/Romanized characters. This is a crucial step for code-mixed data, as it prevents tokenizers from treating words like "Bhai" and "bhai" as two different vocabulary items. The Bengali Unicode block (\u0980-\u09FF) and basic punctuation were carefully preserved to maintain the semantic structure and sentence flow. Annotation Protocol: The labeling was conducted manually by native Bengali speakers who are familiar with local political history and modern internet slang. Annotators read the parent text to establish context, then evaluated the reply. Replies that offered opposing logical arguments, factual corrections, or direct narrative challenges were marked as '1' (Counter). Replies that showed basic agreement, off-topic remarks, or non-argumentative hate speech were marked as '0' (Non_Counter). Instruments and Software: Data Extraction: YouTube Data API v3 Programming Language: Python 3.x Data Wrangling: Pandas, NumPy, RegEx Format: CSV (Comma Separated Values)

Institutions

Categories

Linguistics, Computer Science, Natural Language Processing

Licence