Brazilian University Paper Abstracts Dataset (2013-2023), Classified by Sustainable Development Goals (SDGs)

Published: 24 June 2024| Version 1 | DOI: 10.17632/hzgs5kz2bc.1
Contributors:
,
,
,
,
,
,
,

Description

The data was collected from SciVal, a platform that hosts Scopus statistics. All metadata were obtained from the top 25 Brazilian universities between 2013 and 2023, according to the Center for World University Ranking (CWUR) in 2023. The dataset contains abstracts of published scientific papers classified according to the Sustainable Development Goals (SDGs) by the Scopus team. The original dataset consists of 15,488 records and 20 columns. We preprocessed the data to train a language model capable of classifying Brazilian research projects according to the SDGs. During preprocessing, we removed duplicate records, multi-label entries, samples missing abstracts, and unnecessary columns. The preprocessed dataset contains 13,789 records and two columns, where the SDG classification is represented in the "label" column. The classification ranged from 1 to 17 representing all 17 SDGs in order. After preprocessing the dataset, we balanced it by equalizing the majority and minority classes to 300 records per class. In other words, for majority classes with more than 300 records, we reduced the count to 300. For minority classes with fewer than 300 records, we generated the remaining records using the generative model Mixtral-8x7B-Instruct-v0.1, using the real abstracts as examples. This dataset serves as a valuable resource for training language models tailored to classify scientific texts from Brazil based on the SDGs. The 17 SDGs are: 1. No Poverty 2. Zero Hunger 3. Good Health and Well-being 4. Quality Education 5. Gender Equality 6. Clean Water and Sanitation 7. Affordable and Clean Energy 8. Decent Work and Economic Growth 9. Industry Innovation and Infrastructure 10. Reduced Inequality 11. Sustainable Cities and Communities 12. Responsible Consumption and Production 13. Climate Action 14. Life Below Water 15. Life on Land 16. Peace Justice and Strong Institutions 17. Partnerships for the Goals

Files

Categories

Machine Learning, Deep Learning, Large Language Model

Licence