Wikipedia Human Medicine Corpus

Published: 4 November 2017| Version 2 | DOI: 10.17632/sp9mcx5594.2
Contributors:
Marcos Mouriño García,
,

Description

Wikipedia Human Medicine Corpus is a bilingual—Spanish-English—single-label corpus composed of 2,143 documents extracted from Wikipedia about human medicine written in English, and 469 documents written in Spanish, classified into the following 22 categories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedical surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology.

Files

Steps to reproduce

We first selected the category “Human medicine” – placed under “Health and fitness” Portal:Content category (https://en.wikipedia.org/wiki/Portal:Contents/Health_and_fitness) – which has 22 subcategories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human Genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedic surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology. To create the training set of the corpus, we performed the following steps. First, we selected the articles classified under each of the aforementioned categories. Next, we parsed the HTML code of each article in order to extract the title and the whole body. Finally, we labelled each article with the category to which it belongs to. To create the test set of the Wikipedia Human Medicine corpus we performed the following steps for each article of the training set. First, we obtained the corresponding Spanish Wikipedia article by using the link provided in the English Wikipedia article – if it was available. Then, we parsed the HTML code to extract the title and the whole body of the Spanish article. Finally, we labelled each article with the category which it belongs to. As a result, we obtained a corpus formed by a training sequence composed of 2,143 Wikipedia articles written in English, and a test sequence that comprises 469 Wikipedia articles written in Spanish.

Institutions

Universidade de Vigo

Categories

Data Mining, Biological Classification

Licence