Open Data Created by Elsevier Research and Development teams
Description of this collection
A collection of open datasets created by different groups across Elsevier in collaboration with our research collaboration partners. In line with FAIR data practices, these data are openly shared to foster research and promote reproducibility.
Our current research in data science spans natural language processing, fact extraction and entity identification. We also have projects studying research itself through the lenses of gender, researcher mobility, FAIR data use, peer review and the impact of sustainable development goals.
A collection of datasets published on Mendeley Data that recognize researchers or research groups who make their research data available for additional research and do so in a way that exemplifies the FAIR data principles – Findable, Accessible, Interoperable, Reusable.
Datasets in this collection have been selected by Elsevier's independent Research Data Management Advisory Board.
Read Elsevier's community blog - Elsevier Connect - to discover interviews from researchers who published these datasets.
* Prof. Zhiyong Shao, Fudan University China: https://www.elsevier.com/connect/spotlighting-fair-data-and-the-researchers-behind-it
* Prof Ricardo Sánchez-Murillo, UNA Costa Rica: https://www.elsevier.com/connect/we-dont-want-data-sitting-in-our-desk-says-tropical-cyclone-researcher
* Dr. Vanessa Susini, University of Pisa, Italy: https://www.elsevier.com/connect/for-mendeley-data-winner-sharing-fair-data-helps-researchers-learn-from-each-other
Contributors:Daniel Kershaw, Rob Koeling
This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.
This dataset was released to support the development of ML and NLP models targeting science articles from across all research domains. While the release builds on other datasets designed for specific domains and tasks, it will allow for similar datasets to be derived or for the development of models which can be applied and tested across domains.
Contributors:Karin Verspoor, Dat Quoc Nguyen, Saber A. Akhondi, Christian Druckenbrodt, Camilo Thorne et al
The discovery of new chemical compounds and their synthesis process is of great importance to the chemical industry. Patent documents contain critical and timely information about newly discovered chemical compounds, providing a rich resource for chemical research in both academia and industry. Chemical patents are often the initial venues where a new chemical compound is disclosed. Only a small proportion of chemical compounds are ever published in journals and these publications can be delayed by up to 3 years after the patent disclosure. In addition, chemical patent documents usually contain unique information, such as reaction steps and experimental conditions for compound synthesis and mode of action. These details are crucial for the understanding of compound prior art, and provide a means for novelty checking and validation. Due to the high volume of chemical patents, approaches that enable automatic information extraction from these patents are in demand. To develop natural language processing methods for large-scale mining of chemical information from patent texts, a corpus is created providing chemical patent snippets and annotated entities and reaction steps.
Contributors:Bamini Jayabalasingham, Thomas Collins, Lili Kuiper, Jin Zhang, Guillaume Roberge
Data underlying the analyses in chapters 1, 2, 3, and 5 of the report "The researcher journey through a gender lens" (www.elsevier.com/connect/gender-report), which provides an analysis of the researcher journey, analysed using a gender lens. Data on authors, grantees and patent applicants pertain to researchers active during two periods, 16 geographies, and 26 subject areas and 11 sub-fields of medicine. Theses data are provided at the aggregated level.
Contributors:Elena Zudilova-Seinstra, Alberto Zigoni 10711960, Wouter Haak
We conducted an analysis to confirm our observations that only a very small percentage of public research data is hosted in the Institutional Data Repositories, while the vast majority is published in the open domain-specific and generalist data repositories.
For this analysis, we selected 11 institutions, many of which have been our evaluation partners. For each institution, we counted the number of datasets published in their Institutional Data Repository (IDR) and tracked the number of public research datasets hosted in external data repositories via the Data Monitor API. External tracking was based on the corpus of 14+ mln data records checked against the institutional SciVal ID. One institution didn’t have an IDR.
We found out that 10 out of 11 institutions had most of their public research data hosted outside of their institution, where by research data we mean not only datasets, but a broader notion that includes, for example, software.
We will be happy to expand it by adding more institutions upon request.
Note: This is version 2 of the earlier published dataset. The number of datasets published and tracked in the Monash Institutional Data Repository has been updated based on the information provided by the Monash Library. The number of datasets in the NTU Institutional Data Repository now includes datasets only. Dataverses were excluded to avoid double counting.
Contributors:Finlay Maclean, Helena Deus, Antony Scerri
This dataset was extracted from the Elsevier Pathway Studio, a tool that helps scientists analyze experimental data to answer biologically meaningful questions. The dataset itself consists of biological relationships between diseases (MERS and SARS), proteins and molecules. The relationships are of various types including Regulation, Target, Molecular Transport, etc. You can find a mapping of the relationship name to a description on this support page: https://service.elsevier.com/app/answers/detail/a_id/3014/supporthub/pathway/
The source of these relationships are life sciences and biomedical articles, from various publishers. We make use of taxonomies, curated and maintained by subject matter experts, to extract the right terms from text and map them to the correct identifiers. Subject matter experts have also helped us create the rules and information extraction patterns to optimize the extraction of relationships from text.
At last, the pubmed identifiers from which the relationships were extracted are also part of the dataset.
The .cypher and .json files can be imported into the graph database neo4j. The .csv files can be used to import into other systems.
We selected Public health, and societal and psychological impacts datasets indexed by the Mendeley Data Search engine on the 2019-present COVID-19 / Coronavirus pandemic. The aim was to make it easier to find potentially relevant datasets for this specific topic
We selected Epidemiology & infectious modelling datasets that are indexed by the Mendeley Data Search engine on the 2019-present COVID-19 / Coronavirus pandemic. The aim was to make it easier to find potentially relevant datasets for this specific topic.