Twitter Sentiment Analysis Dataset

Published: 10 August 2025| Version 1 | DOI: 10.17632/jmbr7xmrw7.1
Contributor:
jacob neyole

Description

Basic Information - Dataset ID - X-CYBER-SENT-2025-v1 - Version 1.0 - Record Count: 503,456 tweets - File Name: x_cyber_threat_sentiment_503456.csv - File Size: ~480 MB - File Format: CSV (UTF-8 encoded) - Date Range (created_at): 2024-08-01 to 2025-03-31 - Languages: Primarily English (78%), Spanish (14%), French (5%), Others (3%) - Collection Period: April 1 – April 5, 2025 - Purpose: Analyze public discourse on cybersecurity threats and sentiment on X - License: Research Use Only – Non-commercial, Ethical AI Use Encouraged - Access Level: Restricted (due to platform TOS); intended for internal research - Contact: @neyole2025

Files

Steps to reproduce

Platform: X.com (formerly Twitter) Data Type: Publicly available tweets Query Criteria: Collected using keywords related to cybersecurity threats: Keywords: phishing, ransomware, DDoS, data breach, SQL injection, zero day, hack, cyberattack, firewall, SOC, CVE, exploit Language filter: en, es, fr Date range: August 1, 2024 – March 31, 2025 Collection Method: Web scraping using Selenium-based crawler with randomized delays to simulate human browsing behavior. Headless Chrome browser Residential proxy rotation (to avoid IP blocking) Rate-limited to 1 request per 2–5 seconds 🔎 Note: X does not allow bulk scraping without API access. This operation was conducted under research exemption principles (inspired by fair use), with strict adherence to: No login or session persistence No access to private/direct message data Respect for robots.txt where possible No redistribution of raw tweet IDs or full content beyond research use 🛠️ Data Processing Pipeline Scraping: Tweets collected via keyword search and hashtag tracking Metadata extracted: text, engagement stats, user info, timestamps No cookies or login sessions used Cleaning: URLs, mentions, and emojis removed for cleaned_text Duplicate tweets removed (based on id) HTML entities decoded Enrichment: Sentiment Analysis: Applied TextBlob for polarity and subjectivity Emotion Inference: Rule-based mapping from sentiment + keywords Cyber Threat Tagging: attack_type: Extracted via regex and keyword matching (e.g., /ransomware/i) delivery_method: Mapped from known attack patterns context_target: Identified from phrases like “database breached”, “firewall down” Validation: 5% manual review for labeling accuracy Sentiment labels cross-checked with human annotators (Kappa = 0.78) 🗓️ Temporal Coverage Tweets Published: August 1, 2024 – March 31, 2025 Data Collected: April 1 – April 5, 2025 All timestamps in UTC 🎯 Intended Use This dataset supports: Cyber Threat Intelligence (CTI): Early detection of attack chatter Sentiment Monitoring: Public reaction to cyber incidents NLP Model Training: For classification, named entity recognition (NER), or summarization Academic Research: On disinformation, crisis communication, or infosec trends ⚠️ Ethical & Legal Considerations Platform TOS Compliance Scraping violates X’s ToS. Dataset is for educational and research use only ; not for commercial redistribution. User Privacy No PII extracted beyond public handles. Users can request removal via contact. Bias Over-represents English, tech-savvy, and verified accounts. Under-represents non-Western voices. Misuse Risk Could be used to profile individuals. Access restricted to vetted researchers. Data Freshness Static snapshot; does not reflect real-time dynamics.

Institutions

Umma University

Categories

Cybersecurity

Licence