SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection

Published: 25 December 2025| Version 1 | DOI: 10.17632/hz2d6gz7pc.1
Contributors:
,
,
,
,
,
,
,
,
,
,

Description

The SpaPhish dataset is a Spanish-language email corpus created under the hypothesis that phishing detection in Spanish benefits from (i) native-language message content and (ii) explicit, human-validated psychological annotations that make social-engineering tactics measurable and interpretable. The data provides both the linguistic surface (email subject and body) and message-level technical indicators commonly associated with email abuse, enabling joint analysis of what the message says and how it is technically packaged. SpaPhish contains 1,463 hashed message records; 1,457 of them include a valid binary class label encoded as 000/001 (distribution: 001 = 780, 000 = 677). Each record includes: (a) a unique hash, (b) subject, body, and date when available (dates span 2014-06-07 to 2025-10-27 in the current file), and (c) extracted technical fields such as url_count (plus the URL list), hops_count, and attachment descriptors (attachments_count, types, and sizes). Notable corpus-level observations from the labeled subset are that 85.6% of messages contain at least one URL (url_count>0), attachments are present in 18.4% of messages (attachments_count>0), and hop counts are typically small (median hops_count = 3). A distinctive component of SpaPhish is its psychological layer: for each message, three independent annotators (_A, _B, _C) provide binary presence labels (000/001) for five persuasion principles reflected in the schema: authority, social_proof, liking/similarity deception, commitment/integrity/reciprocation, and distraction, along with free-text justifications (justif_*) documenting the rationale for each judgment. In addition, consolidated (non-suffixed) columns provide an aggregated label per principle intended for benchmarking and downstream modeling. To use the dataset correctly, consumers should treat the per-annotator labels as the primary evidence, use the consolidated labels as a convenience layer, and filter out the small fraction of records that contain non-binary placeholders in some annotation fields.

Files

Steps to reproduce

1.- Collect raw emails. Start from the original email sources in Spanish (raw email files, e.g., RFC-822 style). Store each message as an individual file and assign a stable identifier. In SpaPhish, each record is referenced by a hash that uniquely identifies the message. Extract message text. For each raw email, parse the header/body structure and extract: subject (may be empty if the original email has no subject) body (plain text as available after parsing) date (if present in the header; otherwise leave empty) Derive technical features from the email artifact. From the parsed message and its header metadata, compute and store: URL information: extract all URLs from the body, store the list in urls, and compute url_count. Attachments: count attachments (attachments_count) and record types and sizes (attachments_types, attachments_sizes, and size totals as provided in the dataset). Routing depth: compute hops_count from the chain of relay headers (e.g., by counting “Received” hops after normalization). Assign the binary class label. For each email, assign the message-level binary class Label ∈ {0,1} according to the dataset’s ground-truth convention (phishing vs non-phishing as defined by the curation protocol used for SpaPhish). Ensure every record has exactly one Label value. Annotate persuasion principles with three evaluators. For each email, three independent annotators (A, B, C) assign binary presence labels (0/1) for the following persuasion dimensions: authority social_proof liking_similarity_deception commitment_integrity_reciprocation distraction Annotators also provide a short free-text justification for each assigned label (justif_*) documenting the evidence in the message (phrases, cues, or rhetorical patterns). Compute consolidated (consensus) labels. For each principle, compute the consolidated label from the three annotators (e.g., by majority vote), and store it in the non-suffixed principle columns. Keep the per-annotator columns to enable agreement analysis. Export the dataset. Export one row per email record into a CSV file with the complete schema (47 fields), preserving: the unique hash subject, body, date all derived technical features Label per-annotator principle labels (*_A, *_B, *_C) and justif_* consolidated principle labels Validation checks (recommended). Before release, validate: hash is unique (no duplicates) Label has no missing values and contains only {0,1} all principle label fields contain only {0,1} empty subject is allowed; empty body should be handled explicitly (remove or justify) date may be empty if unavailable

Institutions

Universidad Iberoamericana

Categories

Cybersecurity, Natural Language Processing, Machine Learning, Data Analytics Cybersecurity

Funders

Licence