UVigoMED

Published: 30 Dec 2016 | Version 1 | DOI: 10.17632/p3jkppwr29.1
Contributor(s):

Description of this data

UVigoMED is a monolingual (english) and single-label corpus composed of 92,661 biomedical abstracts extracted from MEDLINE, classified under 26 MeSH categories.

Experiment data files

Steps to reproduce

First, we selected the classification scheme, consisting of the MeSH general terms of “diseases” group—the same as in OHSUMED. It is worth noting that, to create the UVigoMED corpus, we used the 2015 MeSH tree structure, where the diseases group contains 26 categories instead of the 23 that contained the MeSH tree structure when OHSUMED was created. To build the corpus we performed the following steps:

• We downloaded from MEDLINE all the descriptions of the articles (HTML webpages) of year 2014 classified under each one of the 26 categories.

• We extracted from each article description: the title, the abstract, and the categories it belongs to.

• We stored in our database the title, abstract and categories for each article description that was downloaded.

As a result, we obtained a corpus that comprises 92,661 biomedical articles classified in one or several categories of the 26 that were available. Finally, in order to create the training and test sequences, we randomly selected 18,532 documents as the test sequence, remaining 74,129 for the training sequence.

To carry out the single-label experiments, we created a subset of the aforementioned corpus comprising those documents belonging to just one category—by removing those that belonged to more than one category—resulting in a corpus composed of 54,853 documents classified in one of the 26 categories, and split randomly in a training sequence that comprises 43,882 documents and a test sequence composed by 10,971 items.

Latest version

  • Version 1

    2016-12-30

    Published: 2016-12-30

    DOI: 10.17632/p3jkppwr29.1

    Cite this dataset

    Mouriño-García, Marcos Antonio; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis Eulogio (2016), “UVigoMED”, Mendeley Data, v1 http://dx.doi.org/10.17632/p3jkppwr29.1

Institutions

University of Vigo

Categories

Data Mining, Biological Classification

Mendeley Library

Organise your research assets using Mendeley Library. Add to Mendeley Library

Licence

CC BY NC 3.0 Learn more

The files associated with this dataset are licensed under a Attribution-NonCommercial 3.0 Unported licence.

What does this mean?

You are free to adapt, copy or redistribute the material, providing you attribute appropriately and do not use the material for commercial purposes.

Report