Wikipedia Corpus

Published: 04-11-2017| Version 2 | DOI: 10.17632/nsm3ftcjf6.2
Contributors:
Marcos Mouriño García,
Roberto Pérez-Rodríguez,
Luis Eulogio Anido-Rifón

Description

Wikipedia Corpus is a bilingual—Spanish-English—single-label corpus composed of 3,019 documents about general topics written in English, and 832 documents written in Spanish, classified under three semantically distant categories: Culture and the arts, Geography and places and Mathematics and logic.

Files

Steps to reproduce

Wikipedia has a set of pages intended to serve as “Main Pages” for specific topics or areas called Portals. Within all available portals, the Portal:Contents portal provides a navigation system to help browsing content in the encyclopedia, and it is organised into 12 categories: General reference, Culture and the arts, Geography and places, Health and fitness, History and events, Mathematics and Logic, Neural and physical sciences, People and self, Philosophy and thinking, Religion and belief systems, Society and social sciences and Technology and applied sciences (https://en.wikipedia.org/wiki/Portal:Contents/Categories). In order to create our corpus, we use the aforementioned categories because they are especially useful when users do not know exactly what they are looking for, or for when they want to see everything on a particular subject. Among them, we selected three semantically distant categories: Culture and the arts, Geography and places and Mathematics and logic. In order to create the training sequence of the corpus we performed the following steps. First we selected approximately 1,000 articles for each category – Culture and the arts, Geography and places and Mathematics and logic. Next, in order to extract the relevant information of each article, we parsed the HTML code and we selected the title and the whole body of the article. Finally, we labelled each article with the Portal:Content category to which it belongs to. Each Wikipedia article has a set of links that provide the equivalent article in the different languages in which it is available. In order to create the test set of the corpus we performed the following steps for each article of the training set. First, we obtained the corresponding Spanish Wikipedia article by using the link provided in the English Wikipedia article – if it was available. Next, we parsed the HTML code to extract the title and the whole body of the Spanish article. Finally, we labelled each article with the Portal:Content category to which it belongs to. As a result, we obtained a corpus formed by a training sequence that comprises 3,019 Wikipedia articles written in English, and a test sequence composed of 832 Wikipedia articles written in Spanish