Arabic Scam and Legitimate Call Conversation Dataset (ASLC-448)

Published: 13 April 2026| Version 2 | DOI: 10.17632/p384bgyzz3.2
Contributor:
Mohammed Tawfik

Description

## Description This dataset presents a novel multi-dialect Arabic scam and legitimate telephone call conversation corpus designed for training and evaluating scam detection models. The dataset addresses a critical gap in Arabic-language fraud detection research, where no publicly available scam call datasets currently exist. The dataset contains 448 annotated conversations covering nine Arabic dialects: Modern Standard Arabic (MSA), Egyptian, Gulf, Jordanian, Saudi, Yemeni, Sudanese, Iraqi, and Syrian. Each conversation simulates a realistic telephone interaction structured as a multi-turn dialogue between a caller and a receiver over five utterance turns (three caller turns and two receiver turns). ## Data Structure The Excel file contains 18 columns per conversation: | Column | Type | Description | |--------|------|-------------| | conversation_id | String | Unique identifier (CONV_0001 to CONV_0448) | | full_conversation | String | Complete conversation text with speaker labels | | caller_turn_1 | String | First caller utterance | | receiver_turn_1 | String | First receiver response | | caller_turn_2 | String | Second caller utterance | | receiver_turn_2 | String | Second receiver response | | caller_turn_3 | String | Third caller utterance | | label | String | Binary class label: scam or not_scam | | category | String | Fine-grained category (23 categories) | | dialect | String | Arabic dialect (9 dialects) | | urgency_score | Integer | Time pressure intensity (0–5) | | sensitive_info_requests | Integer | Confidential data solicitation (0–2) | | financial_pressure_score | Integer | Monetary demands intensity (0–5) | | threat_score | Integer | Threat/intimidation level (0–3) | | impersonation_score | Integer | Identity deception level (0–2) | | conversation_length | Integer | Total characters in conversation | | word_count | Integer | Total words in conversation | | label_binary | Integer | Binary encoding: 1 = scam, 0 = not_scam | | File | Description | |------|-------------| | arabic_scam_dataset_complete.xlsx | Complete text dataset with 448 conversations, labels, categories, dialects, risk scores, and metadata (18 columns) | | audio_dataset/scam/*.wav | Synthesized audio files for scam conversations (16 kHz, mono, WAV) | | audio_dataset/not_scam/*.wav | Synthesized audio files for legitimate conversations (16 kHz, mono, WAV) | --- ## License CC BY 4.0 (Creative Commons Attribution 4.0 International)

Files

Categories

Cybersecurity, Bidirectional Encoder Representations From Transformers, Large Language Model

Licence