Wikipedia Human Medicine Corpus

Published: 4 Nov 2017 | Version 2 | DOI: 10.17632/sp9mcx5594.2
Contributor(s):

Description of this data

Wikipedia Human Medicine Corpus is a bilingual—Spanish-English—single-label corpus composed of 2,143 documents extracted from Wikipedia about human medicine written in English, and 469 documents written in Spanish, classified into the following 22 categories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedical surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology.

Experiment data files

Steps to reproduce

We first selected the category “Human medicine” – placed under “Health and fitness” Portal:Content category (https://en.wikipedia.org/wiki/Portal:Contents/Health_and_fitness) – which has 22 subcategories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human Genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedic surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology.

To create the training set of the corpus, we performed the following steps. First, we selected the articles classified under each of the aforementioned categories. Next, we parsed the HTML code of each article in order to extract the title and the whole body. Finally, we labelled each article with the category to which it belongs to.

To create the test set of the Wikipedia Human Medicine corpus we performed the following steps for each article of the training set. First, we obtained the corresponding Spanish Wikipedia article by using the link provided in the English Wikipedia article – if it was available. Then, we parsed the HTML code to extract the title and the whole body of the Spanish article. Finally, we labelled each article with the category which it belongs to. As a result, we obtained a corpus formed by a training sequence composed of 2,143 Wikipedia articles written in English, and a test sequence that comprises 469 Wikipedia articles written in Spanish.

Latest version

  • Version 2

    2017-11-04

    Published: 2017-11-04

    DOI: 10.17632/sp9mcx5594.2

    Cite this dataset

    Mouriño García, Marcos; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis Eulogio (2017), “Wikipedia Human Medicine Corpus”, Mendeley Data, v2 http://dx.doi.org/10.17632/sp9mcx5594.2

Previous versions

  • Version 1 (unavailable)

    2016-12-30

Compare to version

Institutions

University of Vigo

Categories

Data Mining, Biological Classification

Mendeley Library

Organise your research assets using Mendeley Library. Add to Mendeley Library

Licence

CC BY NC 3.0 Learn more

The files associated with this dataset are licensed under a Attribution-NonCommercial 3.0 Unported licence.

What does this mean?

You are free to adapt, copy or redistribute the material, providing you attribute appropriately and do not use the material for commercial purposes.

Report