Kazakhstani news corpus for social significance identification with topic modelling results

Published: 18 December 2020| Version 1 | DOI: 10.17632/hwj24p9gkh.1
Kirill Yakunin,


The presented news corpora consists of 1142735 documents from open Kazakhstani news media and from governmental development programs. The dataset is presented in a form of zip archive containing 12CSV (comma-separated values) files with the dataset split into 100 000 documents in each file. Each document (row) consists of the following fields: ID Title Text Source URL Datetime Number of views 90 columns with hand-picked and topic groups weights with semantic names (group_economy, group_politics, etc.). They were normalized to range from 0 to 1 200 columns with topic weights obtained through topic modelling. These columns represent a theta-matrix of the topic model topic-words.json file represents words with weights for the 200 topics obtained through topic-modellig. It is a compressed representation of a phi matrix topic-expert-labelling-sentiment.json contains expert labelling of topics sentiment. It was used to obtain results described in the cited article.


Steps to reproduce

Python library Scrapy was used to scrap a number of open news web sites from Kazakhstan and Russia. For each website a set of scraping rules was configured in the form of either CSS-selectors, or regular expressions, which define how to find text and other meta-data of an article inside the HTML code of the web page. The information system in which the scrapers were implemented, and other results related to this corpora were published in [https://www.mdpi.com/2073-8994/12/12/1945] BigARTM topic modelling library was used to obtain the topic weights presented. https://www.researchgate.net/publication/300135972_BigARTM_Open_Source_Library_for_Regularized_Multimodal_Topic_Modeling_of_Large_Collections The parameters of the model Number of topics - 200 SmoothSparseThetaRegularizer - 0.15, SmoothSparsePhiRegularizer - 0.15, DecorrelatorPhiRegularizer - 0.15, ImproveCoherencePhiRegularizer - 0.15 num_collection_passes (fit_offline function parameter) - 10 Theta matrix of the topic model is represented by 200 topic_{i} columns in the corpora Phi matrix is uploaded separately in a topic-words.json file


Natural Language Processing, Corpus Linguistics