Thematic Modelling of News Archive from University Website
This dataset contains a set of files to suuport and illustrate successive steps of thematic modeling for news line’s text docs and data for further investigations. The file "etalon export_file.csv" presents 2000 Russian language news records, which is a part of the archive of the university website sstu.ru. Each record has a numerical record identifier, head of news, and URL-code of news. The excel-file "News_tokens.xlsx" contains information about extracted tokens for news records after text processing. Text processing included tag eliminating, text cleaning, and word stemming as usual. The file column values are: "Record number", "Identifier of news", Head of news", "Number of words in the news", "Number of unique tokens in news", and "List of news tokens". Then, thematic modeling was done for news in a Vowpal Wabbit format based on the probabilistic distribution of keywords in the text. The results of thematic modeling via the BigARTM platform (https://github.com/bigartm/bigartm, doi:10.5281/zenodo.288960) are presented in the files "phi_matrix.xlsx" and "theta_matrix.xlsx". After a series of experiments six topics were identified. Each of the 2000 lines in the "Theta_matrix" defines the probability of relevant news documents belonging to six topics. In the "Phi_matrix" we see the topic's probabilities for tokens. Under consideration of most frequent tokens (keywords) for each topic, the names for the topics were formulated as follows: "Events", "Holidays", "Science and Innovations", "Educational and Scientific Activities", "Student Competitions", '"Admission Campaign". To determine the most significant topics for news, the probability values in "Theta_matrix" greater than some specially calculated bound were rounded to 1, i.e., the defuzzification of news distribution had been done. The Venn diagram on the picture in file "Venn_diagram.png" illustrates the belonging of news docs to the topics. The multiple sets visualization was performed using the "supervenn" package (https://github.com/gecko984/supervenn/tree/v0.3.1, doi:10.5281/zenodo.4016732). Thus, the dataset of Newsline's docs can be used for educational purposes and science research in text processing, machine learning fields. The full archive of news for 2009-2021 years, scrambled from the site www.sstu.ru, is in the file “News_Articles.csv”. The source of new data was the archive of news articles of the Saratov State University (SSU) from April 2007 to May 2022. The primary data file was formed as a result of the web scrapping of the newsline of the SSU website from the start page www.sgu.ru/news/all. Then, duplicate records, as well as records with empty fields were removed, and text processing had been processed. As a result, the excel file "News_SGU_31077_Processed_1.xlsx" of 31077 records was formed. The presented dataset can use for data modeling, as well as in sociological science investigations.
Steps to reproduce
he content of the basic file "etalon export_file.csv" is part of the news archive of Saratov State Technical University, which is accessible from the webpage www.sstu.ru/news/ via HTTPS-protocol. The basic file was provided by the IT department of SSTU in CSV format, the full archive of news for 2009-2021 years “News_Articles.csv” had been scrambled from the news webpage of the site. These files are for use in teaching and science research only.