Published: 30-12-2016| Version 1 | DOI: 10.17632/p3jkppwr29.1
Marcos Mouriño García,
Roberto Pérez-Rodríguez,
Luis Eulogio Anido-Rifón


UVigoMED is a monolingual (english) and single-label corpus composed of 92,661 biomedical abstracts extracted from MEDLINE, classified under 26 MeSH categories.


Steps to reproduce

First, we selected the classification scheme, consisting of the MeSH general terms of “diseases” group—the same as in OHSUMED. It is worth noting that, to create the UVigoMED corpus, we used the 2015 MeSH tree structure, where the diseases group contains 26 categories instead of the 23 that contained the MeSH tree structure when OHSUMED was created. To build the corpus we performed the following steps: • We downloaded from MEDLINE all the descriptions of the articles (HTML webpages) of year 2014 classified under each one of the 26 categories. • We extracted from each article description: the title, the abstract, and the categories it belongs to. • We stored in our database the title, abstract and categories for each article description that was downloaded. As a result, we obtained a corpus that comprises 92,661 biomedical articles classified in one or several categories of the 26 that were available. Finally, in order to create the training and test sequences, we randomly selected 18,532 documents as the test sequence, remaining 74,129 for the training sequence. To carry out the single-label experiments, we created a subset of the aforementioned corpus comprising those documents belonging to just one category—by removing those that belonged to more than one category—resulting in a corpus composed of 54,853 documents classified in one of the 26 categories, and split randomly in a training sequence that comprises 43,882 documents and a test sequence composed by 10,971 items.