SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection
Description
Spanish is widely used in real-world phishing campaigns, yet public email corpora remain largely English-centric and rarely encode social-engineering tactics at the psychological level. As a result, Spanish research often collapses the problem to binary detection and cannot systematically study how manipulation is expressed in language. SpaPhish addresses this gap by providing Spanish-native emails annotated under Ana Ferreira’s persuasion principles, with three independent annotators per message and written justifications. SpaPhish tests the hypothesis that phishing-related behavior in Spanish email can be modeled more reliably when messages are natively Spanish (not translated or synthetic) and paired with explicit, human-grounded persuasion annotations. The dataset contains 1,395 emails described by 47 variables. Each record is identified by a hash key and includes subject, body, and a date field (parseable for 1,371 records; 2014-06-07 to 2025-10-27). A binary Label is available for all entries (0 = 664; 1 = 731). SpaPhish also provides a technical layer of derived attributes (e.g., URL statistics, hops, attachments). Link-bearing content appears in 86.02% of messages. Class-level aggregates differ: Label 0 shows higher mean url_count (8.47) and attachments_count (0.715) than Label 1 (url_count = 4.94; attachments_count = 0.033). A defining component is the psychological annotation layer: three annotators label five persuasion dimensions (authority, social proof, liking/similarity deception, commitment–integrity–reciprocation, distraction). Per-annotator columns (*_A, *_B, C) and justification fields (justif_) are preserved, alongside consolidated fields for benchmarking and analysis of inter-annotator variability. SpaPhish is a multi-layer resource for Spanish phishing detection, persuasion modeling, and annotation-driven explainability, supporting research that links technical email features to psychologically grounded manipulation strategies.
Files
Steps to reproduce
SpaPhish was built through a structured workflow that preserves the original email artifacts while adding technical attributes and human psychological annotations: 1.- Collect and identify raw emails. Spanish-language emails were collected from original sources and stored as individual message files (e.g., RFC-822 artifacts). Each email was assigned a stable identifier, represented by a unique hash in the dataset. 2.- Parse and extract core fields. Emails were programmatically parsed to separate headers and body. The dataset stores subject, body (plain text), and date when available; missing subjects are allowed and dates may be empty if unavailable. 3.- Derive technical attributes. From body and header metadata, technical variables were computed: URL extraction (urls list and url_count), attachment descriptors (attachments_count, types, sizes, total size), and routing depth (hops_count from normalized “Received” chains). 4.- Assign the binary Label (manual curation). Each email received a single binary class label (Label ∈ {0,1}) under the dataset curation protocol (phishing vs legitimate). Label assignment and record-level curation/verification were performed manually. 5.- Annotate persuasion principles. Three independent annotators (A, B, C) labeled five Ana Ferreira persuasion dimensions with binary values (0/1): authority, social proof, liking-similarity-deception, commitment–integrity–reciprocation, and distraction. Annotators also wrote short justifications (justif_*). 6.- Compute consolidated reference labels. For each principle, a consolidated label was derived from the three annotations (e.g., majority agreement) and stored alongside the per-annotator columns. 7.- Anonymize for public release. Before publishing, message content was manually processed to reduce exposure of sensitive or personally identifying information while preserving linguistic and rhetorical structure. 8.- Export and validate. The final dataset was exported as a UTF-8, semicolon-separated CSV (47 variables; one row per email) and validated for unique hashes, complete Label values, and binary constraints for persuasion fields.
Institutions
- Universidad Iberoamericana
Categories
Funders
- Ibero American UniversityMexicoGrant ID: Project: "Creación de un dataset de mensajes de phishing en español"