Spanish Biomedical Corpus
Description
Embeddings This repository contains the word embeddings generated from biomedical Spanish texts corpora. Corpus detail The corpus was gathered from Spanish biomedical texts from different multilingual biomedical sources: IBECS (Spanish Bibliographical Index in Health Sciences): corpus that collects scientific journals covering multiple fields in health sciences. Contains titles and abstracts from 168,198 records in English and Spanish. SciELO (Scientific Electronic Library Online): corpus gathers electronic publications of complete full-text articles from scientific journals of Latin America, South Africa, and Spain. Contains titles and abstracts from 161,710 records in English and Spanish. Pubmed: free search engine used to access the MedlineNLM (https://www.ncbi.nlm.nih.gov/pubmed/). Contains titles and abstracts from 127,619 records. MedlinePlus: corpus with health topics, drugs and supplements, laboratory test information, and medical encyclopedia texts contains 7,033 articles in English and Spanish. UFAL Medical Corpus is a collection of parallel corpora of medical and general domain texts. All corpus data files can be found in the next link: http://temu.bsc.es/mespen/ Pre-trained Models FastText We used the FastText (Bojanowski et al., 2016) implementation to training our word embeddings using the preprocessed Spanish Biomedical corpus (FastText-SBC). Moreover, we trained a concept embedding model replacing biomedical concepts in the Spanish Biomedical corpus with their unique SNOMED-CT Spanish Edition iden-tifier (SNOMED-SBC). We used the PyMedTer-mino library (Lamy et al., 2015) for concept indexing using full-text search and fuzzy search with threshold. Train Parameters Dimension = 300 epoch=10,20 min_count=20 neg=20 t=6e-5 thread=7 encoding='utf8' min subword-ngram = 3 max subword-ngram = 6 Links to the embeddings FastText-SBC, epoch 10 FastText-SBC, epoch 20 SNOMED-SBC
Files
Steps to reproduce
Corpus data files The corpus was gathered from Spanish biomedical texts from different multilingual biomedicalsources: IBECS (Spanish Bibliographical Index in Health Sciences): corpus that collects scientific journals covering multiple fields in health sciences. Contains titles and abstracts from 168,198 records in English and Spanish. SciELO (Scientific Electronic Library Online): corpus gathers electronic publications of complete full text articles from scientific journals of Latin America, South Africa and Spain. Contains titles and abstracts from 161,710 records in English and Spanish. Pubmed: free search engine used to accessthe MedlineNLM (https://www.ncbi.nlm.nih.gov/pubmed/). Contains titles and abstracts from 127,619 records. MedlinePlus: corpus with health topics, drugs and supplements, laboratory test information and medical encyclopedia texts contains 7,033 articles in English and Spanish. UFAL Medical Corpus: is a collection of parallel corpora of medicaland general domain texts. All corpus data files can be found in the next link: http://temu.bsc.es/mespen/ Raw text Files Raw biomedical text file SBC (Spanish Biomedical Corpus) - SBC Raw biomedical text file SBC post-processing - SBC-post Requeriments FastText: is a library for efficient learning of word representations and sentence classification. Command: pip install fasttext spaCy: is a free, open-source library for advanced Natural Language Processing (NLP) in Python. Command: pip install -U spacy Code TextCorpus_Preprocessing: preprocess SBC text file to obtain SBC-post text file. Pre-processing consist in remove puntuactio, lower text, remove, trim and stopwords. Train_FastText: generate word embeddings with FastText implementation.