Thematic Modelling of News Archive from University Website
This dataset contains a set of files illustrating successive steps of thematic modeling for news line’s text docs. The file "etalon export_file.csv" presents 2000 Russian language news records, which is a part of the archive of the university website sstu.ru. Each record has a numerical record identifier, head of news, and code of news in HTML format. The excel-file "News_tokens.xlsx" contains information about extracted tokens for news records after text processing. Text processing included tag eliminating, text cleaning, and word stemming as usual. The file column values are: "Record number", "Identifier of news", Head of news", "Number of words in the news", "Number of unique tokens in news", and "List of news tokens". Then, thematic modeling was done for news in a Vowpal Wabbit format based on the probabilistic distribution of keywords in the text. The results of thematic modeling via the BigARTM platform (https://github.com/bigartm/bigartm, doi:10.5281/zenodo.288960) are presented in the files "phi_matrix.xlsx" and "theta_matrix.xlsx". After a series of experiments six topics were identified. Each of the 2000 lines in the "Theta_matrix" defines the probability of relevant news documents belonging to six topics. In the "Phi_matrix" we see the topic's probabilities for tokens. Under consideration of most frequent tokens (keywords) for each topic, the names for the topics were formulated as follows: "Events", "Holidays", "Science and Innovations", "Educational and Scientific Activities", "Student Competitions", '"Admission Campaign". To determine the most significant topics for news, the probability values in "Theta_matrix" greater than some specially calculated bound were rounded to 1, i.e., the defuzzification of news distribution had been done. The Venn diagram on the picture in file "Venn_diagram.png" illustrates the belonging of news docs to the topics. The multiple sets visualization was performed using the "supervenn" package (https://github.com/gecko984/supervenn/tree/v0.3.1, doi:10.5281/zenodo.4016732). Thus, the dataset of Newsline's docs can be used for educational purposes and science research in text processing, machine learning fields. The full archive of news for 2009-2021 years, scrambled from the site www.sstu.ru, is in the file “News_Articles.csv”, which may be useful for further investigations in data modeling as well as in sociological sciences.
Steps to reproduce
he content of the basic file "etalon export_file.csv" is part of the news archive of Saratov State Technical University, which is accessible from the webpage www.sstu.ru/news/ via HTTPS-protocol. The basic file was provided by the IT department of SSTU in CSV format, the full archive of news for 2009-2021 years “News_Articles.csv” had been scrambled from the news webpage of the site. These files are for use in teaching and science research only.