ML-UVigoMED

Published: 30-12-2016| Version 1 | DOI: 10.17632/xt8n7ybhhn.1
Contributors:
Marcos Mouriño García,
Roberto Pérez-Rodríguez,
Luis Eulogio Anido-Rifón

Description

Multi-Lingual UVigoMED (ML-UVigoMED) is a multilingual single-label corpus composed of 11,126 English biomedical documents extracted from Wikipedia about human medicine, and 12,521 documents written in Spanish, French, Slovenian, German, Italian, Galician, Icelandic and Romanian, classified into the following 22 categories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedical surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology.

Files

Steps to reproduce

We first selected the category Human medicine of Wikipedia, contained under the portal Health and fitness (https://en.wikipedia.org/wiki/Portal:Health_and_fitness). The Human medicine category, in its turn, is divided into 22 categories: Alternative medicine, Cardiology, Endocrinology, Forensics, Gastroenterology, Human genetics, Geriatrics, Gerontology, Gynecology, Hematology, Nephrology, Neurology, Obstetrics, Oncology, Ophthalmology, Orthopedical surgical procedures, Pathology, Pediatrics, Psychiatry, Rheumatology, Surgery and Urology. In order to create the training sequence of the corpus—composed of English documents—we selected the Wikipedia articles classified under each one of the previous categories and parsed the HTML code to extract the textual information. To create the test sequence—composed of Spanish, French, Slovenian, German, Italian, Galician, Icelandic and Romanian documents—we followed the interlanguage links of the training sequence articles to each one of the aforementioned languages. As a result, we obtained a corpus composed of 11,126 English training elements and 12,521 test elements written in several languages, as indicated in the following table: Sequence | Language | #documents -------------------------------------------------------- Training | English | 11,126 -------------------------------------------------------- Test | Spanish | 2,530 ---------------------------------------- | French | 2,753 ---------------------------------------- | Slovenian | 463 ---------------------------------------- | Italian | 2,166 ---------------------------------------- | German | 3,147 ---------------------------------------- | Galician | 701 ---------------------------------------- | Icelandic | 217 ---------------------------------------- | Romanian | 544 ---------------------------------------- | Total | 12,521 --------------------------------------------------------