CryptoVision: A Comprehensive Dataset for Crypto News and Trend Prediction

Published: 7 November 2025| Version 2 | DOI: 10.17632/3c3xtxtfb6.2
Contributors:
,
, Anupam Talukder Rajib,
,

Description

This dataset presents a large-scale collection of 188,430 cryptocurrency news records published between 2017 and 2025. The data were gathered from six reputable and verified sources — BlockWorks, CoinDesk, CoinTelegraph, CryptoPanic, CryptoNews, and Decrypt — using Python-based automated collection pipelines. Each record provides rich and structured metadata about a cryptocurrency-related news article, including the title, publication date and time, cryptocurrency type, and source URL. No copyrighted article text is included; only publicly accessible metadata are shared to ensure full copyright compliance. To align market behavior with news events, each article is matched with Binance OHLCV candle data of a 15-minute timespan, allowing researchers to study short-term price reactions following news releases. Each record in the dataset consists of the following columns: URL, Title, Date_Time, Coin_Type, Sentiment_Label, Sentiment_Score, Open, High, Low, Close, Volume, Movement_OpenClose_%, Movement_HighLow_%, and Market_Move. This dataset enables comprehensive sentiment analysis, market impact assessment, and event-driven modeling in cryptocurrency research, offering a unique link between news sentiment and real-time price fluctuations.

Files

Steps to reproduce

News Metadata Collection – Cryptocurrency news articles from BlockWorks, CoinDesk, CoinTelegraph, CryptoPanic, CryptoNews, and Decrypt (2017–2025) were collected using Python libraries such as BeautifulSoup, Requests, and Selenium. Extracted metadata included URL, Title, Date_Time, and Coin_Type. Only metadata were retained in the published dataset to comply with copyright rules. Sentiment Analysis – The full text of each article (used locally for processing) was analyzed with fine-tuned BERT and DistilBERT models to generate Sentiment_Label and Sentiment_Score. Manual validation on random samples ensured high annotation quality. Note: The full text is not included in the published dataset. Market Data Integration – Historical cryptocurrency market data (Open, High, Low, Close, Volume) were retrieved solely from the Binance API. Data were aggregated using a 15-minute candle timespan, aligned with the publication time of each article. Derived metrics such as Movement_OpenClose_%, Movement_HighLow_%, and Market_Move (upward, downward, stable) were calculated to reflect short-term market movements. Data Merging and Standardization – Metadata, sentiment labels, and Binance candle data were merged into a clean, structured CSV file containing the following columns: URL, Title, Date_Time, Coin_Type, Sentiment_Label, Sentiment_Score, Open, High, Low, Close, Volume, Movement_OpenClose_%, Movement_HighLow_%, Market_Move.

Institutions

  • Leading University Department of Computer Science and Engineering

Categories

Computer Science, Machine Learning, Sentiment Analysis, Cryptocurrency

Licence