CryptoVision: A Comprehensive Dataset for Crypto News and Trend Prediction

Published: 29 September 2025| Version 1 | DOI: 10.17632/wvjjxr8bxx.1
Contributors:
,
,
,
,

Description

This dataset provides a large-scale collection of 188,431 cryptocurrency news articles published between 2017 and 2025. The articles were gathered from reputable sources, including BlockWorks, Coindesk, Cointelegraph, CryptoPanic,CryptoNews, and Decrypt, using Python-based web scraping pipelines. Each entry contains rich metadata such as the article title, full-text content, publication source, URL, sentiment label, and the percentage change in the corresponding cryptocurrency’s price movement around the publication time. To enhance research utility, sentiment labels were automatically generated using fine-tuned transformer models (BERT and DistilBERT), with additional manual checks for validation. Articles are categorized into three sentiment classes: positive, negative, and neutral. This dataset is particularly valuable for machine learning and financial research, including: 1.Training and evaluating sentiment analysis models tailored to financial text. 2.Developing cryptocurrency price prediction models by linking sentiment with historical price and volume data. 3.Investigating the correlation between news sentiment and market trends across multiple cryptocurrencies. 4.Exploring the potential of sentiment-driven trading strategies in highly volatile markets. The dataset’s scale, diversity, and decade-long coverage make it a unique resource for researchers in computer science, finance, and data science, especially those working on natural language processing (NLP), predictive modeling, and market forecasting.

Files

Steps to reproduce

1. Data Collection Cryptocurrency news articles (2017–2025) were collected from BlockWorks, Coindesk, Cointelegraph, CryptoPanic, Decrypt, and CryptoNews using Python-based web scraping (BeautifulSoup, Requests, Selenium). Extracted metadata included: URL, Title, Description, Full Text, Date_Time, and Coin_Type. 2. Data Preprocessing The text was cleaned and standardized to create Filtered_Text by removing HTML tags, duplicates, and special symbols. Publication date and time formats were also standardized. 3. Sentiment Annotation Fine-tuned BERT and DistilBERT models were applied to classify each article into positive, negative, or neutral. A Sentiment_Score was generated to represent model confidence. Manual validation was performed on a sample set to ensure quality. 4. Price and Market Data Integration Historical cryptocurrency market data (Open, High, Low, Close, Volume) were retrieved from APIs such as CoinMarketCap or Binance. Market movement metrics were then calculated, including: Movement_OpenClose_% = percentage change between open and close prices. Movement_HighLow_% = percentage difference between high and low prices. Market_Move = categorical direction of market movement (upward, downward, stable). 5. Final Dataset Preparation Article metadata, sentiment labels, and financial indicators were combined into a structured dataset with the following columns: URL, Title, Description, Full_Text, Date_Time, Coin_Type, Filtered_Text, Sentiment_Label, Sentiment_Score, Open, High, Low, Close, Volume, Movement_OpenClose_%, Movement_HighLow_%, Market_Move. The dataset was stored in CSV format and made publicly accessible via Mendeley Data with documentation for reuse and reproducibility.

Institutions

  • Leading University

Categories

Computer Science, Machine Learning, Computer Engineering

Licence