Dataset of bibliometric records and topic modelling outputs for AI-driven boiler optimisation in thermal power plants (2014–2025)

Published: 3 July 2026| Version 1 | DOI: 10.17632/b2d4fvt6dr.1
Contributor:
Opeyemi Akerekan

Description

This dataset supports the study of artificial intelligence (AI)-driven boiler optimisation in thermal power plants through bibliometric analysis and topic modelling. It comprises bibliographic records retrieved from the Scopus database for English-language journal articles and review papers published between 2014 and 2025. The dataset was developed to examine the evolution of research on AI applications for boiler performance optimisation, combustion control, emissions reduction, predictive maintenance, fault diagnosis, and intelligent monitoring in thermal power generation. It includes raw bibliometric records, processed text data, document–term matrices, Latent Dirichlet Allocation (LDA) outputs, temporal topic trends, and supporting bibliometric statistics. The LDA model identifies seven latent research topics from document titles, abstracts, and author keywords. The dataset contains document–topic probability distributions (theta.csv), topic–term probability distributions (beta.csv), dominant topic assignments, top-ranked topic terms, topic labels, and annual topic prevalence. These files enable users to investigate thematic structures, analyse the evolution of research topics over time, and compare topic distributions across publications. The dataset also includes processed text files (cleaned_corpus.csv and dtm.csv) to facilitate replication of the topic modelling workflow. Supporting files such as document_topic_full.csv and country_stats.csv provide integrated bibliometric metadata and publication statistics for further analysis. The data can be interpreted at multiple levels. Bibliometric records support analyses of publication trends, research productivity, collaboration patterns, and institutional or country contributions. The LDA outputs provide probabilistic representations of document topics, where higher topic probabilities indicate stronger thematic relevance. Topic–term probabilities identify the most representative terms within each topic, while annual topic prevalence enables assessment of changes in research emphasis over time. Researchers may use this dataset to reproduce the published analysis, evaluate alternative topic modelling approaches, benchmark text mining methods, conduct scientometric studies, or investigate emerging trends in AI applications for thermal power plants and energy systems. The dataset is compatible with R, Python, MATLAB, and other software environments that support CSV-formatted data.

Files

Steps to reproduce

The dataset was developed through a bibliometric and topic modelling workflow to investigate research on artificial intelligence (AI)-driven boiler optimisation in thermal power plants published between 2014 and 2025. Bibliographic records were retrieved from the Scopus database using a structured search strategy designed to capture publications related to artificial intelligence, machine learning, deep learning, boiler systems, combustion optimisation, and thermal power plants. Only English-language journal articles and review papers published between 2014 and 2025 were included. The bibliographic records were exported in CSV format as scopus_data.csv. The document titles, abstracts, and author keywords were combined to form the text corpus for analysis. Text preprocessing involved converting text to lowercase, removing punctuation, numbers, and stopwords, followed by tokenisation, stemming or lemmatisation, and rare-term filtering. The resulting corpus was saved as cleaned_corpus.csv, and a document–term matrix (dtm.csv) was generated for topic modelling. Latent Dirichlet Allocation (LDA) was applied to the document–term matrix using seven topics (K = 7). The model was estimated using Gibbs sampling. The resulting document–topic probability matrix (theta.csv) and topic–term probability matrix (beta.csv) were exported. Additional outputs include the complete topic–term matrix (LDA_topic_term_matrix.csv), the highest-probability terms for each topic (LDA_top_terms_by_topic.csv), and the dominant topic assigned to each document (LDA_document_topic_assignment.csv). Topic labels were assigned through interpretation of the highest-probability terms within each topic. Temporal topic trends were generated by aggregating document-level topic probabilities by publication year and applying LOESS smoothing to illustrate changes in topic prevalence over time. These results are provided in topic_trends.csv. Supporting files include document_topic_full.csv, which combines bibliographic metadata with document-level topic distributions, and country_stats.csv, which summarises publication counts by country. The dataset can be reproduced using standard bibliometric and text mining tools in R or Python. Commonly used R packages include bibliometrix, tm, topicmodels, tidytext, slam, dplyr, and ggplot2, while equivalent workflows can be implemented in Python using pandas, scikit-learn, gensim, and related libraries.

Institutions

Categories

Computer Science, Engineering, Energy Engineering, Information Science, Boiler

Licence