Kazakhstani and Russian news corpus

Published: 18 December 2020| Version 1 | DOI: 10.17632/2vz7vtbhn2.1
Kirill Yakunin,


The presented news corpora consists of 6261953 documents from open Kazakhstani and Russian news media. The dataset is presented in a form of zip archive containing 26 CSV (comma-separated values) files with the dataset split into 250 000 documents in each file. Each document (row) consists of the following fields: ID Title Text Source URL Datetime Number of views


Steps to reproduce

Python library Scrapy was used to scrap a number of open news web sites from Kazakhstan and Russia. For each website a set of scraping rules was configured in the form of either CSS-selectors, or regular expressions, which define how to find text and other meta-data of an article inside the HTML code of the web page. The information system in which the scrapers were implemented, and other results related to this corpora were published in several papers including: https://www.mdpi.com/2073-8994/12/12/1945 https://cyberleninka.ru/article/n/proektirovanie-struktury-programmnoy-sistemy-obrabotki-korpusov-tekstovyh-dokumentov


Natural Language Processing, Corpus Linguistics