Thematic Modelling of News Archive from University Website
Description
This dataset contains a set of files illustrating successive steps of thematic modeling for newsline’s text docs. The basic file "etalon export_file.csv" is a part of the archive of the university website sstu.ru and presents 2000 Russian language news records about the university's life events in 2010-2014 years. Each record of the file has fields: numerical record identifier, head of news, and code of news in HTML format. The excel-file "News_tokens.xlsx" contains information about extracting tokens for news records after text processing. Text processing included as usual tag eliminating, text cleaning, and word stemming. The file column values are: "Record number", "Identifier of news", "Head of news", "Number of words in the news", "Number of unique tokens in news", and "List of news tokens". Then, thematic modeling, based on probabilistic distribution of keywords in the text, was done for news in a Vowpal Wabbit format. The results of thematic modeling via the BigARTM platform (https://github.com/bigartm/bigartm, doi:10.5281/zenodo.288960) are present in the files "phi_matrix.xlsx" and "theta_matrix.xlsx". After a series of experiments, six topics were identified. Each of the 2000 lines in the "Theta_matrix" corresponds to a news document and defines the probabilities of belonging to six topics. In the "Phi_matrix" we see the topic's probabilities for tokens. Under consideration of most frequent tokens (keywords) for each topic, the topic’s names were formulated as follows: "Events", "Holidays", "Science and Innovations", "Educational and Scientific Activities", "Student Competitions", '"Admission Campaign". To determine the most significant topics for news, the probability values in "Theta_matrix" greater than some specially calculated bound were rounded to 1, i.e., the defuzzification of news distribution had been done. The Venn diagram on the picture in file "Venn_diagram.png" illustrates a belonging of news docs to the topics. The multiple sets visualization was performed using the "supervenn" package (https://github.com/gecko984/supervenn/tree/v0.3.1, doi:10.5281/zenodo.4016732). Thus, the dataset of newsline's docs can be used for educational purposes and science research in text processing and machine learning fields.
Files
Steps to reproduce
The contents of the basic file "etalon export_file.csv" is part of news archive of Saratov State Technical University, which is accessible from webpage www.sstu.ru/news/ via HTTPS-protocol. The basic file was provided by the IT department of SSTU in CSV format for use in teaching and science research only.