Skip to main navigation Skip to search Skip to main content

SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection

  • Lázaro Bustio-Martínez (Creator)
  • Viviana Fuentes-Fuentes (Creator)
  • Luisa Fernanda Agudelo-Fuentes (Creator)
  • Vitali Herrera-Semenets (Creator)
  • Darian Llanes-Guilarte (Creator)
  • Felipe Antonio Trujillo-Fernández (Creator)
  • Carlos Antonio Cardeña-Matamoros (Creator)
  • Carlos Francisc Betancourt-Moreno (Creator)
  • Joshua Ismael Haase-Hernández (Creator)
  • Andrés Guillermo Molano-Jiménez (Creator)
  • J. van den Berg (Creator)

Dataset

Description

Spanish is widely used in real-world phishing campaigns, yet public email corpora remain largely English-centric and rarely encode social-engineering tactics at the psychological level. As a result, Spanish research often collapses the problem to binary detection and cannot systematically study how manipulation is expressed in language. SpaPhish addresses this gap by providing Spanish-native emails annotated under Ana Ferreira’s persuasion principles, with three independent annotators per message and written justifications.

SpaPhish tests the hypothesis that phishing-related behavior in Spanish email can be modeled more reliably when messages are natively Spanish (not translated or synthetic) and paired with explicit, human-grounded persuasion annotations.

The dataset contains 1,395 emails described by 47 variables. Each record is identified by a hash key and includes subject, body, and a date field (parseable for 1,371 records; 2014-06-07 to 2025-10-27). A binary Label is available for all entries (0 = 664; 1 = 731).

SpaPhish also provides a technical layer of derived attributes (e.g., URL statistics, hops, attachments). Link-bearing content appears in 86.02% of messages. Class-level aggregates differ: Label 0 shows higher mean url_count (8.47) and attachments_count (0.715) than Label 1 (url_count = 4.94; attachments_count = 0.033).

A defining component is the psychological annotation layer: three annotators label five persuasion dimensions (authority, social proof, liking/similarity deception, commitment–integrity–reciprocation, distraction). Per-annotator columns (*_A, *_B, C) and justification fields (justif_) are preserved, alongside consolidated fields for benchmarking and analysis of inter-annotator variability.

SpaPhish is a multi-layer resource for Spanish phishing detection, persuasion modeling, and annotation-driven explainability, supporting research that links technical email features to psychologically grounded manipulation strategies.
Date made available25 Dec 2025
PublisherMendeley Data

Cite this