WhatsApp Anonymized Privacy-focused Interactions Dataset (WAPI dataset)
Description
1. Summary This dataset contains processed and fully anonymized metadata from WhatsApp chat histories. It is designed for researchers in fields such as Computational Linguistics, Social Network Analysis (SNA), and Human-Computer Interaction (HCI). Unlike raw chat logs, this dataset preserves user privacy by removing all message content and personally identifiable information (PII), replacing them with structural descriptors (e.g., message length ranges, emoji arrays) and cryptographic hashes. 2. Methodology The raw data was processed using a custom Python pipeline centered on privacy-by-design. The transformation includes:Temporal Anonymization: Timestamps were shifted with a random noise offset (-4 to +5 seconds) and converted into relative_time_seconds to mask actual dates and times while preserving the cadence of interaction.Identity Masking: Senders and mentioned phone numbers were transformed into 16-character SHA256 hashes using a unique salt for each file.Content Abstraction: Textual content was discarded. In its place, the dataset provides binned message lengths (ranges), punctuation markers (interrogative/exclamatory), and extracted emoji lists.Conversation Clustering: An Inter-Quartile Range (IQR) based algorithm was used to segment messages into "sessions" or conversation_ids, identifying natural breaks in communication based on temporal gaps. 3. Data Description The data is provided in CSV format (semicolon-separated). Each file represents a specific group or chat, containing the following features: Column,Description id,Sequential message identifier within the file. conversation_id,Cluster ID representing a continuous session of interaction. num_characters,"Binned message length (e.g., ""1-10"", ""11-50"", ""500+"")." relative_time_seconds,Seconds elapsed since the start of the observation period. message_type,"Categorization of the entry (e.g., text, audio, image, sticker, system)." responds_to_id,The ID of the parent message in a reply thread (if applicable). array_emojis,List of unique emojis present in the original message. interrogative,Boolean; indicates the presence of question marks. exclamatory,Boolean; indicates the presence of exclamation marks. sender_hash,Salted SHA256 hash of the message sender. mentioned_phones_hash,"List of hashes for phone numbers mentioned via ""@"" in the text." 4. Potential Research Applications Interaction Dynamics: Analyzing response times and turn-taking patterns in digital communication.Non-Verbal Communication: Studying the usage and frequency of emojis across different conversation types.Network Topology: Mapping the flow of information through "responds_to" hilos (threads) and mentions.Behavioral Modeling: Detecting session-based patterns and communication bursts.