Wikipedia Human Medicine Corpus

Name: Wikipedia Human Medicine Corpus
Creator: Marcos Mouriño García
Published: 2017-11-04T17:49:11.041Z
Keywords: Data Mining, Biological Classification

Mouriño García, Marcos; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis Eulogio

doi:10.17632/sp9mcx5594.2

Wikipedia Human Medicine Corpus

Published: 4 November 2017| Version 2 | DOI: 10.17632/sp9mcx5594.2

Contributors:

Marcos Mouriño García, Roberto Pérez-Rodríguez, Luis Eulogio Anido-Rifón

Description

Wikipedia Human Medicine Corpus is a bilingual—Spanish-English—single-label corpus composed of 2,143 documents extracted from Wikipedia about human medicine written in English, and 469 documents written in Spanish, classified into the following 22 categories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedical surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology.

Files

Steps to reproduce

We first selected the category “Human medicine” – placed under “Health and fitness” Portal:Content category (https://en.wikipedia.org/wiki/Portal:Contents/Health_and_fitness) – which has 22 subcategories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human Genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedic surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology. To create the training set of the corpus, we performed the following steps. First, we selected the articles classified under each of the aforementioned categories. Next, we parsed the HTML code of each article in order to extract the title and the whole body. Finally, we labelled each article with the category to which it belongs to. To create the test set of the Wikipedia Human Medicine corpus we performed the following steps for each article of the training set. First, we obtained the corresponding Spanish Wikipedia article by using the link provided in the English Wikipedia article – if it was available. Then, we parsed the HTML code to extract the title and the whole body of the Spanish article. Finally, we labelled each article with the category which it belongs to. As a result, we obtained a corpus formed by a training sequence composed of 2,143 Wikipedia articles written in English, and a test sequence that comprises 469 Wikipedia articles written in Spanish.

Institutions

Universidade de Vigo

Wikipedia Human Medicine Corpus

Description

Files

Steps to reproduce

Institutions

Categories

Licence