A Bilingual Malay-English Social Media Dataset for Binary Hate Speech Detection
Description
This dataset consists of 26,985 bilingual Malay-English social media posts curated for binary hate speech detection tasks. The data are collected and processed from five publicly available sources: HateXplain (English), HateM (Malay), Toxicity-Small (Malay), Snapshot-Twitter-2022 (Malay), and Supervised-Twitter (Malay). Among them, HateXplain, HateM, and Toxicity-Small contain manually annotated labels, while Snapshot and Supervised-Twitter are pseudo-labeled using a custom pipeline. Each entry includes cleaned social media text, a binary hate speech label (0 = non-hate, 1 = hate), language code (en for English, ms for Malay), and the original source. The dataset is balanced across classes, with 14,642 non-hate and 12,343 hate-labeled entries. Language distribution includes 13,609 English and 13,376 Malay-language texts. Preprocessing is conducted separately for each language. Malay-language texts underwent spelling correction, slang normalisation, placeholder standardisation (<user>, <number>), and translation filtering using Malaya NLP tools and custom-built dictionary (malayslangdict.py in slang_dictionary_custom folder). English texts are processed using Ekphrasis to handle social media-specific patterns, emojis, hashtags, and slang. Low-quality or duplicate entries are removed to ensure the data quality. The dataset is stored in CSV format (UTF-8) and is suitable for training and evaluating multilingual and low-resource hate speech detection models.
Files
Steps to reproduce
The dataset collection undergone a structured workflow involving data acquisition, language-aware preprocessing, pseudo-labeling, and balancing. Five datasets are retrieved from GitHub and HuggingFace, which includes manually labeled and unlabeled sources. Manually annotated samples from HateXplain, HateM, and Toxicity-Small are kept as-is. For Snapshot-Twitter-2022 and Supervised-Twitter, a pseudo-labeling pipeline is applied using Malaya’s T5-based sentiment and emotion models. Texts scoring ≥0.85 for either negative sentiment or anger, or containing predefined toxic keywords, ae labeled as hate (1). Texts with sentiment and emotion scores ≤0.3 are labeled as non-hate (0). Only confident non-hate pseudo-labeled texts are kept. Malay texts are preprocessed using Malaya NLP tools with added slang correction (custom malayslangdict), spelling normalisation, inline English-to-Malay translation (custom engmalaydict), and placeholder formatting (<user>, <number>). English texts are cleaned with Ekphrasis, which normalises Twitter-specific symbols, slangs, emojis, URLs, and punctuation. All texts are filtered for minimum length, cleaned of redundant placeholders, and deduplicated. After preprocessing, the final dataset is manually verified for label consistency and language accuracy. Hate and non-hate texts are selected to match English class distributions, and data are merged into a single file named bilingual_hatespeech_ms_en.csv. All steps are executed using Python 3.11 with pandas, regex, tqdm, and Malaya libraries.