Thematic Modelling of News Archive from University Website

Name: Thematic Modelling of News Archive from University Website
Creator: Sargey Papshev
Published: 2023-04-11T06:53:42.770Z
Keywords: Machine Learning, Data Modelling, Text Processing, Media Sociology

Papshev, Sargey

doi:10.17632/ckwcdwz6bg.3

Thematic Modelling of News Archive from University Website

Published: 11 April 2023| Version 3 | DOI: 10.17632/ckwcdwz6bg.3

Contributor:

Sargey Papshev

Description

This dataset contains a set of files to suuport and illustrate successive steps of thematic modeling for news line’s text docs and data for further investigations. The file "etalon export_file.csv" presents 2000 Russian language news records, which is a part of the archive of the university website sstu.ru. Each record has a numerical record identifier, head of news, and URL-code of news. The excel-file "News_tokens.xlsx" contains information about extracted tokens for news records after text processing. Text processing included tag eliminating, text cleaning, and word stemming as usual. The file column values are: "Record number", "Identifier of news", Head of news", "Number of words in the news", "Number of unique tokens in news", and "List of news tokens". Then, thematic modeling was done for news in a Vowpal Wabbit format based on the probabilistic distribution of keywords in the text. The results of thematic modeling via the BigARTM platform (https://github.com/bigartm/bigartm, doi:10.5281/zenodo.288960) are presented in the files "phi_matrix.xlsx" and "theta_matrix.xlsx". After a series of experiments six topics were identified. Each of the 2000 lines in the "Theta_matrix" defines the probability of relevant news documents belonging to six topics. In the "Phi_matrix" we see the topic's probabilities for tokens. Under consideration of most frequent tokens (keywords) for each topic, the names for the topics were formulated as follows: "Events", "Holidays", "Science and Innovations", "Educational and Scientific Activities", "Student Competitions", '"Admission Campaign". To determine the most significant topics for news, the probability values in "Theta_matrix" greater than some specially calculated bound were rounded to 1, i.e., the defuzzification of news distribution had been done. The Venn diagram on the picture in file "Venn_diagram.png" illustrates the belonging of news docs to the topics. The multiple sets visualization was performed using the "supervenn" package (https://github.com/gecko984/supervenn/tree/v0.3.1, doi:10.5281/zenodo.4016732). Thus, the dataset of Newsline's docs can be used for educational purposes and science research in text processing, machine learning fields. The full archive of news for 2009-2021 years, scrambled from the site www.sstu.ru, is in the file “News_Articles.csv”. The source of new data was the archive of news articles of the Saratov State University (SSU) from April 2007 to May 2022. The primary data file was formed as a result of the web scrapping of the newsline of the SSU website from the start page www.sgu.ru/news/all. Then, duplicate records, as well as records with empty fields were removed, and text processing had been processed. As a result, the excel file "News_SGU_31077_Processed_1.xlsx" of 31077 records was formed. The presented dataset can use for data modeling, as well as in sociological science investigations.

Files

Steps to reproduce

he content of the basic file "etalon export_file.csv" is part of the news archive of Saratov State Technical University, which is accessible from the webpage www.sstu.ru/news/ via HTTPS-protocol. The basic file was provided by the IT department of SSTU in CSV format, the full archive of news for 2009-2021 years “News_Articles.csv” had been scrambled from the news webpage of the site. These files are for use in teaching and science research only.

Institutions

Saratovskij nacional'nyj issledovatel'skij gosudarstvennyj universitet imeni N G Cernysevskogo, Saratovskij gosudarstvennyj tehniceskij universitet imeni Gagarina U A

Thematic Modelling of News Archive from University Website

Description

Files

Steps to reproduce

Institutions

Categories

Related Links

Licence