A Dataset for Named Entity Recognition in the Sports Domain

Published: 19 January 2026| Version 2 | DOI: 10.17632/rcf4kbxtf8.2
Contributors:
,
,
,
,
,
,
,

Description

The main idea behind this dataset was to explore whether sports-specific text needs its own dedicated Named Entity Recognition (NER) resource rather than relying on general-purpose datasets. Sports articles often mention entities such as players, teams, tournaments, match times, equipment, and penalties, which are usually not captured well by standard NER labels. This dataset was created with the belief that a focused, domain-aware dataset can help models better understand sports-related language. The data shows that real-world sports text contains a rich mix of different entity types, often appearing together in the same sentence. For example, a single sentence may mention a player, the team they represent, the tournament they are playing in, and the date or time of the match. Keeping the text in its raw form preserves these natural patterns and reflects how sports news is actually written and consumed. One noticeable observation is how frequently time and date expressions are used to describe sports events, alongside rules or penalties that explain match situations. These patterns highlight why temporal and rule-based entities are important for understanding sports narratives and were therefore included in the dataset. The dataset is organized as a simple token–label structure, making it easy to use with common NER models. Since no preprocessing was applied, users are free to experiment with their own tokenization or cleaning methods based on their research needs. This flexibility makes the dataset suitable for a wide range of experiments, from traditional sequence models to modern transformer-based approaches. Overall, the dataset is intended as a practical and realistic resource for anyone working on sports-related text analysis. It can support applications such as automatic match summaries, event timelines, sports chatbots, and content classification systems. By focusing on real sports language and meaningful entity types, the dataset aims to make sports text easier for machines to understand and work with.

Files

Steps to reproduce

The dataset was created for Named Entity Recognition (NER) in the sports domain. Text data was collected from publicly available sports-related online sources, including sports news articles, match reports, blogs, and commentary pages. Only open-access content was used, and no private, personal, or copyrighted data was intentionally included. The collected sentences were kept in their raw textual form. No preprocessing steps such as tokenization, normalization, stemming, lemmatization, or noise removal were applied before annotation. This approach allows researchers to apply their own preprocessing pipelines and ensures flexibility for different experimental settings. After collection, the raw sports sentences were manually annotated with domain-specific entity labels. The annotation schema consists of eight entity types relevant to sports contexts: Player Name, Team Name, Tournament Name, Location, Equipment, Rules or Penalty, Time & Date, and Common Sports term(CST). Entity annotation was performed following consistent labeling guidelines to ensure uniformity across the dataset. The annotations were applied directly to the raw text using a token–label structure, making the dataset suitable for supervised NER models such as BiLSTM-CRF and transformer-based architectures (e.g. mBERT). The dataset was stored in a table-based text format, enabling easy reuse and reproducibility. Researchers can reproduce the dataset creation process by collecting similar sports-domain text from public sources and applying the same entity definitions and annotation protocol.

Institutions

  • Bangabandhu Sheikh Mujibur Rahman Digital University

Categories

Artificial Intelligence, Natural Language Processing, Machine Learning

Licence