A Multimodal Bangla Meme Dataset for Hate Speech, Sentiment, and Sarcasm Detection with Text–Image Fusion and Lexicon Annotations

Published: 2 February 2026| Version 1 | DOI: 10.17632/d6t8nbkj96.1
Contributors:
,
,

Description

Bangla Multimodal Meme Dataset for Hate Speech, Sarcasm, and Offensive Content Detection This dataset consists of 5,126 Bangla memes annotated for multiple offensive and contextual attributes including hate speech, sarcasm, vulgarity, violence, humor, and category. The dataset is intended to support multimodal NLP research by combining OCR-extracted Bangla text, image metadata, perceptual image fingerprints (pHash), and lexicon-based linguistic features. Due to copyright restrictions, the original meme images are not distributed. Instead, the dataset provides: OCR-extracted Bangla text from each meme English translations Perceptual hash (pHash) as a unique image fingerprint Image metadata (width and height) Manual annotations for hate speech, sarcasm, vulgarity, violence, humor, and category A curated Bangla offensive lexicon for auxiliary feature extraction Researchers can retrieve the original memes using the OCR text via web search and verify exact matches using the provided pHash values. This ensures reproducibility while complying with copyright-safe dataset release practices. The dataset was annotated by three independent annotators following a shared guideline. Annotation reliability was assessed on a stratified subset of 400 memes using Fleiss’ kappa, demonstrating substantial to near-perfect agreement across labels. Additionally, the dataset includes a labeled Bangla offensive lexicon containing 441 terms categorized into vulgar, insult, violent, and hate-associated words. These lexicon features provide complementary linguistic signals for multimodal fusion experiments. This dataset is suitable for research in: Hate speech detection in Bangla memes Sarcasm and humor analysis Offensive language detection Multimodal text–image fusion models Low-resource Bangla NLP research The dataset is released for research and academic use only.

Files

Steps to reproduce

Meme images were collected from publicly available social media sources containing Bangla text and meme-style visual content. All images were processed using EasyOCR to extract Bangla text from the memes. The extracted OCR text was manually cleaned and translated into English. Each meme was annotated by three independent annotators for hate speech, sarcasm, vulgar, violent, funny, and category using a shared annotation guideline. Final labels were assigned using majority voting. To ensure annotation reliability, a stratified subset of 400 memes was re-annotated by all annotators and Fleiss’ kappa was computed. Perceptual hash (pHash) and image metadata (width, height) were extracted from each image to allow future verification without redistributing copyrighted images. A curated Bangla offensive lexicon was used to extract lexicon-based linguistic features from the OCR text. The final dataset was organized into CSV files with metadata, annotations, lexicon features, and documentation to ensure reproducibility. Researchers can reconstruct the memes by searching the OCR text online and verifying the image using the provided pHash values.

Institutions

Categories

Linguistics, Artificial Intelligence, Computational Linguistics, Data Mining, Natural Language Processing, Machine Learning, Pattern Recognition, Text Mining, Meme, Multimodal Learning

Licence