MedNorm: A Corpus and Embeddings for Cross-terminology Medical Concept Normalisation ==================================================================================== MedNorm is a corpus of 27,979 textual descriptions simultaneously mapped to both MedDRA and SNOMED-CT, sourced from five publicly available datasets across biomedical and social media domains. The cross-terminology medical concept embeddings are 64-dimensional vectors for UMLS, MedDRA and SNOMED-CT concepts that are able to capture semantic similarities between concepts from different medical terminologies. ------------------------------------------------------------------------------------ The list of utilised datasets ------------------------------------------------------------------------------------ - CADEC: Karimi, Sarvnaz, et al. "Cadec: A corpus of adverse drug event annotations." Journal of biomedical informatics 55 (2015): 73-81. https://doi.org/10.4225/08/570FB102BDAD2 - TwADR-L: Limsopatham, Nut, and Nigel Collier. "Normalising medical concepts in social media texts by learning semantic representation." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2016. https://doi.org/10.5281/zenodo.55013 - TwiMed: Alvaro, Nestor, Yusuke Miyao, and Nigel Collier. "TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations." JMIR public health and surveillance 3.2 (2017): e24. https://doi.org/10.2196/publichealth.6396 - SMM4H: Sarker, Abeed, et al. "Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task." Journal of the American Medical Informatics Association 25.10 (2018): 1274-1283. https://doi.org/10.17632/rxwfb3tysd.1 - TAC 2017 (ADR Track): Demner-Fushman, Dina, et al. "A dataset of 200 structured product labels annotated for adverse drug reactions." Scientific data 5 (2018): 180001. https://bionlp.nlm.nih.gov/tac2017adversereactions/ ------------------------------------------------------------------------------------ FILES ------------------------------------------------------------------------------------ - mednorm_full.tsv: a corpus in a tab-separated format columns: original_dataset - Name of the original dataset instance_id - Unique instance identifier phrase - Phrase (textual description) meddra_code - Original MedDRA code sct_id - Original SNOMED-CT identifier umls_cui - Original UMLS CUI mapped_meddra_codes - Mapped MedDRA codes (multi-label; multiple candidates) mapped_sct_ids - Mapped SNOMED-CT identifiers (multi-label; multiple candidates) single_mapped_meddra_codes - Single (best) mapped MedDRA code (single-label; after multi-label reduction) single_mapped_sct_ids - Single (best) mapped SNOMED-CT identifier (single-label; after multi-label reduction) - mednorm_raw_10n_40l_5w_64dim.bin: cross-terminology medical concept embeddings (word2vec binary format) ------------------------------------------------------------------------------------ DATA HARMONIZATION PIPELINE (SOURCE CODE) ------------------------------------------------------------------------------------ The source code for data harmonization pipeline is available here: https://github.com/mbelousov/MedNorm-corpus