Bangalabarta : A Spam / Smishing SMS Dataset Bangla

Published: 18 February 2025| Version 2 | DOI: 10.17632/jfkfbw3gzh.2
Contributors:
Md Farhan Shahriyar, Gazi Tanbhir

Description

Description: BangalaBarta is a robust and diverse dataset designed for the detection and classification of spam and smishing (phishing via SMS) messages in Bangla. It contains a total of 2772 SMS messages categorized into three distinct classes: Smishing, Promotional, and Normal SMS. The dataset represents a wide range of text types encountered in Bangla short message services (SMS) across various telecommunication networks, including prominent Bangladeshi telecom operators such as Grameenphone, Banglalink, and Robi, among others. This dataset has been carefully curated to offer a representative sample of common SMS messages exchanged among users in Bangladesh, making it particularly useful for training and evaluating machine learning models aimed at spam and smishing detection. The Smishing class contains messages designed to deceive users into revealing sensitive information, while the Promotional class includes marketing messages from various businesses. The Normal SMS class represents everyday communication between users that are not intended to be malicious or promotional. Key Features: Total messages: 2772 Classes: Smishing, Promotional, Normal SMS Languages: Bangla (Bengali) Telecom Networks Covered: Grameenphone, Banglalink, Robi, and other major telecom services Use Cases: Spam detection, smishing identification, language-based classification models Format: The dataset is available in a structured format (e.g., CSV, JSON) with clear labeling for each message type. Potential Applications: Spam Detection: Identifying unwanted marketing messages from legitimate user communications. Smishing Detection: Classifying fraudulent SMS attempting to steal personal or financial information. Language Processing: Facilitating the development of Bangla language models for text classification. Telecom Security: Enhancing telecom service providers' ability to identify and block malicious SMS traffic. This dataset is ideal for researchers and practitioners working on Bangla language processing, telecom security, and natural language processing (NLP), particularly in contexts where identifying harmful SMS is crucial for ensuring user safety and maintaining secure mobile communication networks.

Files

Categories

Computer Science, Cybersecurity, Network Security, Cyber Attack

Licence