BanglaSarc3: A Benchmark Dataset for Bangla Sarcasm Detection from Social Media to Advance Bangla NLP
Description
BanglaSarc3 dataset serves as a benchmark resource for sarcasm classification in Bangla, ensuring balanced category representation. The primary objective of BanglaSarc3 is to mitigate humor misinterpretation that often leads to digital conflicts and misunderstandings in online communication. To enhance dataset quality, preprocessing steps such as anonymization, duplicate removal, and text normalization were applied. Additionally, three native Bangla speakers independently reviewed and validated the labels, ensuring annotation reliability. BanglaSarc3 introduce BanglaSarc3, a ternary-class dataset containing 12,089 Facebook comments, categorized as follows: - Neutral: 4,056 comments - Sarcastic: 4,012 comments - Non-Sarcastic: 4,021 comments The BanglaSarc3 dataset has significant implications across multiple NLP and AI domains, including: 1. Sarcasm Detection in Bangla Social Media 2. Sentiment and Emotion Analysis 3. Language Modeling and BNLP Advancements 4. Explainable AI (XAI) in Bangla NLP 5. Educational and Research Applications The BanglaSarc3 dataset is openly available for academic and research purposes, fostering collaboration and innovation within the Bangla NLP community. By providing a robust foundation for sarcasm classification, this dataset aims to drive advancements in Bangla-centric AI applications, ensuring more inclusive and context-aware language models.
Files
Steps to reproduce
1. Source: Extract Bangla comments from public Facebook pages, groups, and posts related to humor, news, entertainment, and general discussions. 2. Scraping Method: Use Facebook Graph API or third-party web scraping tools (e.g., Selenium, BeautifulSoup) while adhering to ethical guidelines and privacy policies. 3. Data Format: Store the collected text in a structured format (e.g., CSV or JSON) with metadata including comment ID, timestamp, and post reference.