CORD-19 SciSpaCy Entity Dataset

Published: 23-10-2020| Version 2 | DOI: 10.17632/gk9njn3pth.2
Sujit Pal


Dataset of biomedical entities extracted from the CORD-19 dataset (2020-08-28 and 2020-09-28) using trained NER (trained against CRAFT, JNLPBA, BC5CDR, and BioNLP) and NERL models (UMLS, MeSH, GO, HPO, and RxNorm) from the SciSpaCy project, provided as structured Parquet files. Dataset may be useful for downstream tasks around entity linking and relationship extraction. The work was carried out using Dask on the Saturn Cloud platform, and was a joint effort between Elsevier Labs and Saturn Cloud. Dataset available at: s3://els-labs-website/cord19-scispacy-entities/


Steps to reproduce

Jupyter Notebooks to reproduce are available on:, please follow instructions in the file. Dataset available as Parquet files at (requester pays network charges for downloads): s3://els-saturn-scispacy/cord19-scispacy-entities/