ScamGen

Name: ScamGen
Creator: xu han
Published: 2024-09-16T18:28:14.085Z
Keywords: Security Issue, Telephone

han, xu; qi, yaling; cao, hong bo; li, qiang; pedrycz, Witold; wang, wei

doi:10.17632/dkypjhkmgb.1

ScamGen

Published: 16 September 2024| Version 1 | DOI: 10.17632/dkypjhkmgb.1

Contributors:

,

Description

ScamGen: A Comprehensive Dataset of Chinese Telephone Scams This dataset, created using the ScamGen technique, focuses on capturing the psychological dynamics between scammers and victims in Chinese telephone scams. It is derived from a multi-source data collection framework and is expanded through a template-based data augmentation method, generating diverse and realistic scam scenarios. The dataset emphasizes the interactions between scammers and victims, using sentence- and word-level perturbations to ensure a wide variety of scam types and techniques. This rich dataset covers various scam strategies, such as urgency, impersonation, and emotional manipulation, designed to simulate the real-life psychological tactics employed by scammers. It has been rigorously evaluated and proven to outperform large language models in generating diverse and high-quality scam-related data. Alongside this dataset, five deep learning models for intent detection were developed, with BERT achieving a precision of 86.68%. This dataset is a valuable resource for researchers and practitioners in the fields of cybersecurity and fraud detection, enabling a deeper understanding of telephone scammer tactics and aiding in the development of more effective detection systems.

Files

Steps to reproduce

Reproducibility Instructions: To reproduce the data generation process, follow these main steps: in seed_gen/ 1.Sent2Sample: This script processes the original sentences and maps them to predefined templates for data augmentation. 2.Sample2Data: Converts the generated samples from Sent2Sample into a structured data format. 3.Data2Dataset: Final step in the pipeline, where the structured data is compiled into a dataset ready for experiments. Original seed list: The seed data used for augmentation is located in the directory Data/Seed. Generated data: The three versions of the generated dataset are stored in Data/D_Dataset_v5. Experimental dataset: The "datasetv1 Hard version" in this directory was used for conducting experiments. Official: 0 Speculation: 1 Imprudent: 2 Relationship: 3 Others: 4 Run these scripts in the specified order: Sent2Sample.py, Sample2Data.py, and Data2Dataset.py. The output will be the dataset, stored in Data/D_Dataset_v5, ready for use in experiments. For data analysis, open the provided Jupyter notebooks (.ipynb files) and run the analysis to explore the data quantity and distribution across different dimensions.

Institutions

Xi'an Jiaotong University
Beijing Jiaotong University
University of Alberta

Funders

Beijing Natural Science Foundation
Grant ID: L221014
National Natural Science Foundation of China
China
Grant ID: U21A20463
National Natural Science Foundation of China
China
Grant ID: U23A20304
Haihe Lab of ITAI
Grant ID: 24HHXCSS00003
National Natural Science Foundation of China
China
Grant ID: U22B2027

ScamGen

Description

Files

Steps to reproduce

Institutions

Categories

Funders

Licence