A bilingual corpus composed of biomedical abstracts written in English and Spanish, extracted from MEDLINE.
Steps to reproduce
•First, we selected the classification scheme, consisting of the MeSH general terms of “diseases” group. It is worth noting that, to create the CL-UVigoMED corpus, we used the 2015 MeSH tree structure, where the “diseases” group contains 26 categories instead of the 23 that contained the MeSH tree structure when OHSUMED was created. • We downloaded from MEDLINE all the descriptions of the articles (HTML web pages) from years 2011 to 2015 that were available in English and Spanish and we classified them under each one of the 26 available categories available. • For each article description downloaded: – If the article description contained the information in English and Spanish, we parsed the HTML code in order to extract the relevant information in both languages: title, abstract and the categories it belongs to. – If the article description contained only the information in English, ∗ We parsed the article description HTML code in order to extract the English relevant information. ∗ We acceded to the journal which hosted the article and parsed the HTML code in order to extract the Spanish relevant information. It should be noted that to perform this step it was necessary to program several ad-hoc parsers for each different journal editorial (Elsevier, SciELO, Ediciones Doyma, etc.) • We stored in our database the title, abstract and categories for each document that was downloaded. It is worth noting that the dataset only relies on Spanish papers with an English abstract on grounds of efficiency reasons, since the final objective was to build a bilingual corpus in order to verify the feasibility of the proposed approach. As a result, we obtained a training corpus that comprises 12,832 English biomedical abstracts and a test corpus that comprises 2,184 Spanish abstracts, classified into one or several categories of the 26 that were available. To carry out the single-label experiments, we created a subset of the aforementioned corpora, which comprises those documents belonging to just one category—by removing those belonging to more than one category—resulting in a training corpus composed of 3,356 English documents and a test corpus that comprises 624 Spanish documents, all of them classified into only one of the 26 categories available.