MapIntel Case Study Dataset

Published: 18 July 2023| Version 1 | DOI: 10.17632/7nn6h86snn.1
David Silva


Daily news articles from multiple international sources collected using NewsAPI ( during the period between October 2020 and June 2021. The total number of records is 334,925 documents. The format of the dataset is in JSON. Cleaning is applied to the direct results from the API. We ensure that each document is unique, is written in English, and doesn’t have any HTML tags or any strange pattern. Each record is a dictionary with the following keys and their descriptions: - "text": Cleaned content of the news article (concatenation of "title", "description", and "content" received from the API request. "content" is truncated to 200 characters). - "title": The headline or title of the article. - "url": The direct URL to the article. - "timestamp": The date and time that the article was published, in UTC (+000). Formatted as "%Y-%m-%dT%H:%M:%SZ". - "snippet": Excerpt of the document displayed in the user interface of MapIntel. - "image_url": The URL to a relevant image for the article.



Universidade Nova de Lisboa Instituto Superior de Estatistica e Gestao de Informacao


Information Retrieval, Natural Language Processing, Machine Learning


Fundação para a Ciência e a Tecnologia