Multimodal Dataset for Investment-Related Deceptive Content Detection on Social Platforms

Name: Multimodal Dataset for Investment-Related Deceptive Content Detection on Social Platforms
Creator: Yi Xuan Kong
Published: 2026-05-13T15:28:06.115Z
Keywords: Artificial Intelligence, Social Media, Machine Learning, Multimodality, Deception

Kong, Yi Xuan

doi:10.17632/6wnd7jrt6z.2

Multimodal Dataset for Investment-Related Deceptive Content Detection on Social Platforms

Published: 13 May 2026| Version 2 | DOI: 10.17632/6wnd7jrt6z.2

Contributor:

Yi Xuan Kong

Description

This dataset is a processed and anonymised dataset for research on deceptive or suspicious investment-related content in social platforms. The dataset integrates five public source groups covering phishing text, spam email, fake-profile posts, Twitter bot-detection records, and finance-related social media data, which were harmonised into a common binary schema and filtered for investment relevance. The deposited file contains 16,202 records and 32 columns. It includes source identifiers, text content, binary labels, deterministic data partitions, investment-filter outputs, and 20 standardized behavioral metadata features. These metadata features span raw account and content counts, relational proxy measures, and boolean profile indicators, enabling both text-only and multimodal analysis. The dataset contains 14,035 text-plus-metadata records and 2,167 text-only records, allowing evaluation under partial-modality conditions where behavioral information may be unavailable. The variables support studies in multimodal classification, metadata ablation, cross-source benchmarking, and analysis of behavioral patterns associated with deceptive investment-related communication. Labels represent investment-related deceptive or suspicious behaviour derived from harmonized source annotations after filtering, and should not be interpreted as verified ground-truth investment scam status for every individual record. This dataset is suitable for benchmarking machine learning models, studying heterogeneous social-platform deception signals, and supporting reproducible experiments on investment-focused content detection.

Files

Steps to reproduce

The dataset was constructed from five publicly available source groups related to phishing, spam, fake-profile activity, bot detection, and stock-related social media content. No new raw data were collected. Instead, existing datasets were integrated and transformed into a unified representation for investment-related deceptive or suspicious content analysis. The workflow consisted of three main stages. First, source datasets were harmonised into a common schema. Source-specific text fields were mapped into a single text_content field, labels were converted into a binary target, and provenance information such as record_id, source_dataset, and source_modality was retained to preserve traceability across sources. Second, the harmonised records were filtered to retain investment-related content. A two-stage filtering process was used. The first stage applied a lexicon-based filter using investment-related keywords. Records that were not retained by the lexicon stage were then evaluated by a semantic similarity filter using manually designed investment-related prototype prompts. Sentence embeddings were generated with the sentence-transformers/all-MiniLM-L6-v2 model, and cosine similarity was used to derive a semantic score. Records were retained if they passed either the lexicon or semantic stage. Filtering outputs such as gate_path, lexicon_score, semantic_score, and investment_score were preserved in the final dataset. Third, behavioural feature engineering was applied. A standardised set of 20 canonical metadata features was created where source metadata were available, including raw activity counts, relational proxy features, and Boolean profile indicators. Missing metadata values were preserved as unavailable (NaN) rather than replaced, so unobserved fields remain distinguishable from valid zero-valued observations. A deterministic partitioning procedure based on hashed record identifiers was then used to assign records into train, validation, and test subsets. Before release, the dataset was anonymised by replacing potentially identifying spans such as email addresses, URLs, user handles, phone numbers, and long numeric identifiers with standard placeholder tokens. The final released file is provided in CSV formats to support reproducibility and reuse.

Institutions

Multimedia University
Melaka, Malacca

Multimodal Dataset for Investment-Related Deceptive Content Detection on Social Platforms

Description

Files

Steps to reproduce

Institutions

Categories

Licence