OpenScience Slovenia document metadata dataset

Published: 5 November 2019| Version 1 | DOI: 10.17632/7wh9xvvmgk.1
Mladen Borovič,
Marko Ferme,
Janez Brezovnik,
Sandi Majninger,
Albin Bregant,
Goran Hrovat,
Milan Ojsteršek


The OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.



Natural Language Processing, Metadata, Applied Computer Science, Categorization, Recommendation System, Text Mining