Description
Spanish is widely used in real-world phishing campaigns, yet public email corpora remain largely English-centric and rarely encode social-engineering tactics at the psychological level. As a result, Spanish research often collapses the problem to binary detection and cannot systematically study how manipulation is expressed in language. SpaPhish addresses this gap by providing Spanish-native emails annotated under Ana Ferreira’s persuasion principles, with three independent annotators per message and written justifications.
SpaPhish tests the hypothesis that phishing-related behavior in Spanish email can be modeled more reliably when messages are natively Spanish (not translated or synthetic) and paired with explicit, human-grounded persuasion annotations.
The dataset contains 1,395 emails described by 47 variables. Each record is identified by a hash key and includes subject, body, and a date field (parseable for 1,371 records; 2014-06-07 to 2025-10-27). A binary Label is available for all entries (0 = 664; 1 = 731).
SpaPhish also provides a technical layer of derived attributes (e.g., URL statistics, hops, attachments). Link-bearing content appears in 86.02% of messages. Class-level aggregates differ: Label 0 shows higher mean url_count (8.47) and attachments_count (0.715) than Label 1 (url_count = 4.94; attachments_count = 0.033).
A defining component is the psychological annotation layer: three annotators label five persuasion dimensions (authority, social proof, liking/similarity deception, commitment–integrity–reciprocation, distraction). Per-annotator columns (*_A, *_B, C) and justification fields (justif_) are preserved, alongside consolidated fields for benchmarking and analysis of inter-annotator variability.
SpaPhish is a multi-layer resource for Spanish phishing detection, persuasion modeling, and annotation-driven explainability, supporting research that links technical email features to psychologically grounded manipulation strategies.
SpaPhish tests the hypothesis that phishing-related behavior in Spanish email can be modeled more reliably when messages are natively Spanish (not translated or synthetic) and paired with explicit, human-grounded persuasion annotations.
The dataset contains 1,395 emails described by 47 variables. Each record is identified by a hash key and includes subject, body, and a date field (parseable for 1,371 records; 2014-06-07 to 2025-10-27). A binary Label is available for all entries (0 = 664; 1 = 731).
SpaPhish also provides a technical layer of derived attributes (e.g., URL statistics, hops, attachments). Link-bearing content appears in 86.02% of messages. Class-level aggregates differ: Label 0 shows higher mean url_count (8.47) and attachments_count (0.715) than Label 1 (url_count = 4.94; attachments_count = 0.033).
A defining component is the psychological annotation layer: three annotators label five persuasion dimensions (authority, social proof, liking/similarity deception, commitment–integrity–reciprocation, distraction). Per-annotator columns (*_A, *_B, C) and justification fields (justif_) are preserved, alongside consolidated fields for benchmarking and analysis of inter-annotator variability.
SpaPhish is a multi-layer resource for Spanish phishing detection, persuasion modeling, and annotation-driven explainability, supporting research that links technical email features to psychologically grounded manipulation strategies.
| Date made available | 25 Dec 2025 |
|---|---|
| Publisher | Mendeley Data |
Cite this
- DataSetCite