MedNorm: A Corpus and Embeddings for Cross-terminology Medical Concept Normalisation
====================================================================================

MedNorm is a corpus of 27,979 textual descriptions simultaneously mapped to
both MedDRA and SNOMED-CT, sourced from five publicly available datasets across 
biomedical and social media domains.
The cross-terminology medical concept embeddings are 64-dimensional 
vectors for UMLS, MedDRA and SNOMED-CT concepts that are able to 
capture semantic similarities between concepts from different medical terminologies.

------------------------------------------------------------------------------------
The list of utilised datasets
------------------------------------------------------------------------------------
- CADEC: Karimi, Sarvnaz, et al. "Cadec: A corpus of adverse drug event annotations." 
  Journal of biomedical informatics 55 (2015): 73-81.
  https://doi.org/10.4225/08/570FB102BDAD2

- TwADR-L: Limsopatham, Nut, and Nigel Collier. "Normalising medical concepts in social media texts 
  by learning semantic representation." Proceedings of the 54th Annual Meeting of the Association 
  for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2016.
  https://doi.org/10.5281/zenodo.55013

- TwiMed: Alvaro, Nestor, Yusuke Miyao, and Nigel Collier. "TwiMed: Twitter and PubMed comparable
  corpus of drugs, diseases, symptoms, and their relations." JMIR public health and 
  surveillance 3.2 (2017): e24.
  https://doi.org/10.2196/publichealth.6396

- SMM4H: Sarker, Abeed, et al. "Data and systems for medication-related text 
  classification and concept normalization from Twitter: insights from the
  Social Media Mining for Health (SMM4H)-2017 shared task." 
  Journal of the American Medical Informatics Association 25.10 (2018): 1274-1283.
  https://doi.org/10.17632/rxwfb3tysd.1

- TAC 2017 (ADR Track): Demner-Fushman, Dina, et al. "A dataset of 200 structured 
  product labels annotated for adverse drug reactions." 
  Scientific data 5 (2018): 180001.
  https://bionlp.nlm.nih.gov/tac2017adversereactions/
------------------------------------------------------------------------------------
FILES
------------------------------------------------------------------------------------
  - mednorm_full.tsv: a corpus in a tab-separated format
  	columns:
  		original_dataset - Name of the original dataset
  		instance_id - Unique instance identifier
  		phrase - Phrase (textual description)
  		meddra_code - Original MedDRA code
  		sct_id - Original SNOMED-CT identifier
  		umls_cui - Original UMLS CUI
  		mapped_meddra_codes - Mapped MedDRA codes (multi-label; multiple candidates)
  		mapped_sct_ids - Mapped SNOMED-CT identifiers (multi-label; multiple candidates)
  		single_mapped_meddra_codes - Single (best) mapped MedDRA code (single-label; after multi-label reduction)
  		single_mapped_sct_ids - Single (best) mapped SNOMED-CT identifier (single-label; after multi-label reduction)

  - mednorm_raw_10n_40l_5w_64dim.bin: cross-terminology medical concept embeddings (word2vec binary format)

------------------------------------------------------------------------------------
DATA HARMONIZATION PIPELINE (SOURCE CODE)
------------------------------------------------------------------------------------
The source code for data harmonization pipeline is available here:
https://github.com/mbelousov/MedNorm-corpus