SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection
Description
Spanish is widely used in real-world phishing campaigns, yet public email corpora remain largely English-centric and rarely encode social-engineering tactics at the psychological level. Consequently, research on Spanish phishing has often been reduced to binary detection, limiting the systematic study of how manipulation is conveyed through language. SpaPhish addresses this gap by providing a Spanish-native email corpus annotated under Ana Ferreira’s Principles of Persuasion framework. The dataset contains 1,395 emails described by 47 variables. Each record is identified by a SHA-256 hash key and includes the subject, body, and a date field, which is parseable for 1,371 records and spans from July 2014 to October 2025. A binary class label is available for all entries, with 664 legitimate emails and 731 phishing emails. SpaPhish is structured as a multi-layer resource. Its technical layer includes derived attributes such as URL statistics, routing depth, and attachment metadata. Link-bearing content appears in 86.02% of messages. At the class level, legitimate emails show higher mean values for both url_count (8.47 vs. 4.94) and attachments_count (0.715 vs. 0.033) than phishing emails. A central contribution of the dataset is its psychological annotation layer. Three independent annotators labeled each message across five persuasion dimensions: authority, social_proof, liking_similarity_deception, commitment_integrity_reciprocation, and distraction. The dataset preserves the individual annotator decisions through per-annotator columns (*_A, *_B, *_C) and associated justification fields (justif_*), while also providing consolidated consensus labels for benchmarking and inter-annotator agreement analysis. In cases of complete disagreement, a fourth expert adjudicator resolved the final label. The repository also includes supporting documentation and resources: the primary dataset file (SpaPhish dataset-DiB.csv), a machine-readable schema (dataset_schema.json), a complete variable reference with data types, descriptions, and extraction logic for all 47 variables (SpaPhish_Dataset_Schema.pdf), a data dictionary (README.txt), and an interactive HTML exploratory report (SpaPhish_html_report.zip). Processing and analysis scripts are available at: https://github.com/lbustio/spa_phish. SpaPhish supports research on Spanish phishing detection, persuasion modeling, and annotation-driven explainability by linking technical email attributes with psychologically grounded manipulation strategies.
Files
Steps to reproduce
SpaPhish was built through a structured workflow that preserves the original email artifacts while adding technical attributes and human psychological annotations: 1. Collect and identify raw emails. Spanish-language emails were collected from the personal and institutional inboxes of the dataset contributors and stored as individual message files (RFC-822 artifacts). Duplicate messages were identified and removed using SHA-256 hashes computed over the raw email message. Each unique email was retained for further processing. 2. Parse and extract core fields. Emails were programmatically parsed to separate headers and body. The dataset stores subject, body (plain text), and date when available; missing subjects are allowed and dates may be empty if unavailable. 3. Derive technical attributes. From body and header metadata, technical variables were computed: URL extraction (urls list and url_count), attachment descriptors (attachments_count, types, sizes, total size), and routing depth (hops_count from normalized Received headers). 4. Anonymize for public release. Before labeling, message content was manually processed to remove or replace personally identifiable information (personal names, organizations, email addresses, phone numbers, account identifiers, physical addresses, and location references). Sensitive spans were replaced with fictitious surrogates of the same semantic category to preserve linguistic and rhetorical structure. 5. Assign the binary Label. Each email received a binary class label (Label in {0, 1}) using a triple-annotator protocol following the NIST definition of phishing. Three independent domain experts independently assessed each message and assigned phishing or legitimate labels. Consensus labels were derived by majority voting, with adjudication by a fourth expert in cases of full disagreement. 6. Annotate persuasion principles. Three independent annotators (A, B, C) labeled five persuasion dimensions from Ana Ferreira's framework with binary values (0/1): authority, social_proof, liking_similarity_deception, commitment_integrity_reciprocation, and distraction. Annotators also wrote short structured justifications in Spanish (justif_*). 7. Compute consolidated reference labels. For each persuasion dimension, a consolidated consensus label was derived from the three individual annotations by majority voting. In cases of full disagreement, a fourth expert assigned the final label through adjudication. 8. Export and validate. The final dataset was exported as a UTF-8 semicolon-separated CSV (47 variables; one row per email) and validated for unique SHA-256 hashes, complete Label values, and binary constraints for all persuasion annotation fields. Processing and analysis scripts are available at: https://github.com/lbustio/spa_phish.
Categories
Funders
- Ibero American UniversityMexicoGrant ID: Project: "Creación de un dataset de mensajes de phishing en español"