Steam Games Metadata and Player Reviews (2020–2024)

Published: 30 June 2025| Version 2 | DOI: 10.17632/jxy85cr3th.2
Contributor:
Hisham Abdelqader

Description

This dataset presents a comprehensive and structured collection of video game metadata and user reviews from the Steam platform, covering the period between January 2020 and December 2024. It was compiled to support research into how various game attributes influence user satisfaction, engagement, and review behavior. The central research hypothesis behind this work suggests that specific characteristics of video games, such as genre, pricing, and supported platform, are closely associated with trends in user sentiment and review volume. Understanding these patterns can contribute to predictive models of game reception and improve design and marketing strategies for future releases. To explore this hypothesis, data was gathered in two phases. In the first phase, metadata for all games listed on Steam during the target period was collected using the official Steam API. Each game was identified by its unique AppID and evaluated to ensure data completeness. The scraper retrieved details including the game title, release date, genres, supported languages, age restrictions, and pricing information. Games that were unreleased or launched before 2020 were excluded from the dataset. This resulted in a refined metadata file, stored as games.json, containing detailed information on 23,107 Steam games released from 2020 onward. In the second phase, a dedicated script was used to collect user reviews for each game in the metadata file. The review collection process filtered out games with fewer than 25 reviews to avoid bias due to insufficient data. For the remaining games, reviews were gathered in all available languages to ensure a culturally diverse and inclusive dataset. Reviews were saved in individual CSV files named using the game’s AppID and the number of reviews it contains. Each file includes structured rows with fields such as review text, language, rating, and vote counts. This resulted in over 31 million reviews across more than 23,000 games, forming a robust basis for textual and quantitative analysis. The data reveals several meaningful trends. Free-to-play games tend to attract higher review volumes, although not necessarily higher user ratings. Games within specific genres, such as role-playing, simulation, and survival, often have longer and more detailed reviews, indicating deeper user engagement. By releasing both the metadata and reviews together, this dataset offers a multidimensional view of the Steam game landscape from 2020 to 2024. It captures user engagement in digital gaming during and after the COVID-19 period and provides a foundation for future research in user behavior, content personalization, and the evolving dynamics of online platforms.

Files

Steps to reproduce

The data in this dataset was gathered through a structured, multi-phase scraping workflow designed to ensure transparency, reproducibility, and ethical data collection. It includes game metadata and user reviews for titles released on the Steam platform between January 2020 and December 2024. The process, conducted between November 2024 and January 2025, was built using public APIs, open-source tools, and custom Python scripts. The first phase focused on metadata scraping. A Python script based on Bustos Martin’s open-source project “Steam-Games-Scraper” [1] served as the foundation. This tool uses Steam’s official API (store.steampowered.com/api/appdetails) to retrieve data via each game’s unique AppID. Metadata fields collected included game title, release date, genres, categories, pricing, supported languages, age rating, and audio support. Games unreleased or released before 2020 were excluded. HTML sanitization and text cleaning were applied using regular expressions to remove tags and encoded characters. The script included retry logic, exception handling, and progress tracking. It was executed in Python 3.10 using libraries such as json, requests, os, and re, with BeautifulSoup for nested parsing where required. Autosave functionality ensured data was saved every 100 entries. The output was a structured JSON file (games.json) containing metadata for 23,107 Steam games released from 2020 to 2024. The second phase collected user reviews. This process used a script based on Zhu Zhihan’s “steam-review-scraper” package [2] from PyPi. It connects to Steam’s community review endpoints and returns data in a pandas-compatible format. AppIDs from games.json were used to fetch reviews in all available languages to ensure cultural diversity. Only games with 25 or more reviews were retained to ensure statistical relevance. Games below this threshold were logged in discarded_games.csv. Review data was saved in individual CSV files named as AppID_reviewcount.csv (e.g., 123456_342.csv). Each file includes review text, rating, language, and helpfulness votes if available. The script included up to five retry attempts per game. Failures were logged. A separate scraped_games.csv file tracks successfully completed entries to support incremental scraping. The entire workflow was executed on a Windows system but is compatible with macOS and Unix environments where Python 3.10 and required packages are installed. Randomized delay intervals between requests helped avoid overloading Steam servers, and all data was collected in accordance with Steam’s public API terms. No personal or private information is included, and all data is publicly accessible.

Institutions

University of Wollongong

Categories

Computer Game

Licence