ScamGen

Published: 16 September 2024| Version 1 | DOI: 10.17632/dkypjhkmgb.1
Contributors:
,
,
,
,
,

Description

ScamGen: A Comprehensive Dataset of Chinese Telephone Scams This dataset, created using the ScamGen technique, focuses on capturing the psychological dynamics between scammers and victims in Chinese telephone scams. It is derived from a multi-source data collection framework and is expanded through a template-based data augmentation method, generating diverse and realistic scam scenarios. The dataset emphasizes the interactions between scammers and victims, using sentence- and word-level perturbations to ensure a wide variety of scam types and techniques. This rich dataset covers various scam strategies, such as urgency, impersonation, and emotional manipulation, designed to simulate the real-life psychological tactics employed by scammers. It has been rigorously evaluated and proven to outperform large language models in generating diverse and high-quality scam-related data. Alongside this dataset, five deep learning models for intent detection were developed, with BERT achieving a precision of 86.68%. This dataset is a valuable resource for researchers and practitioners in the fields of cybersecurity and fraud detection, enabling a deeper understanding of telephone scammer tactics and aiding in the development of more effective detection systems.

Files

Steps to reproduce

Reproducibility Instructions: To reproduce the data generation process, follow these main steps: in seed_gen/ 1.Sent2Sample: This script processes the original sentences and maps them to predefined templates for data augmentation. 2.Sample2Data: Converts the generated samples from Sent2Sample into a structured data format. 3.Data2Dataset: Final step in the pipeline, where the structured data is compiled into a dataset ready for experiments. Original seed list: The seed data used for augmentation is located in the directory Data/Seed. Generated data: The three versions of the generated dataset are stored in Data/D_Dataset_v5. Experimental dataset: The "datasetv1 Hard version" in this directory was used for conducting experiments. Official: 0 Speculation: 1 Imprudent: 2 Relationship: 3 Others: 4 Run these scripts in the specified order: Sent2Sample.py, Sample2Data.py, and Data2Dataset.py. The output will be the dataset, stored in Data/D_Dataset_v5, ready for use in experiments. For data analysis, open the provided Jupyter notebooks (.ipynb files) and run the analysis to explore the data quantity and distribution across different dimensions.

Institutions

Xi'an Jiaotong University, Beijing Jiaotong University, University of Alberta

Categories

Security Issue, Telephone

Funding

Beijing Natural Science Foundation

L221014

National Natural Science Foundation of China

U21A20463

National Natural Science Foundation of China

U23A20304

Haihe Lab of ITAI

24HHXCSS00003

National Natural Science Foundation of China

U22B2027

Licence