NeoMHCI Training and Evaluation Data

Published: 11 June 2024| Version 1 | DOI: 10.17632/kmt8tx7gh6.1
Contributor:
Wei Qu

Description

All training and evaluation data used in the NeoMHCI study. All code is freely available at https://github.com/ZhuLab-Fudan/NeoMHCI. ------------------------------------------------------ Train Data: 1. EL_train.zip: Contains all eluted ligand data used for training the five-fold cross-validation model. - EL2020_A.txt: Corresponds to the EL2020_A dataset mentioned in the article. - EL2020_B.txt: Corresponds to the EL2020_B dataset mentioned in the article. Combining the data from these two files constitutes the EL2020_C dataset mentioned in the article. 2. NE_train.zip Contains all neoepitope data used during the fine-tuning process for the immunogenicity prediction task. - NE2023.txt: Corresponds to the NE2023 dataset mentioned in the article. - NE2023_list.json: A candidate pool constructed from all wild-type sequences in NE2023. - IN2023: Corresponds to the IN2023 dataset mentioned in the article, used as a validation set during the fine-tuning process. ------------------------------------------------------ Test Data: 3. EL_test.zip Contains the test set for ligand presentation prediction, along with the prediction scores of NeoMHCI and other comparison methods. - IM2020.csv: Corresponds to the IM2020 test set mentioned in the article. - IS2020.csv: Corresponds to the IS2020 test set mentioned in the article. 4. NE_test.zip Contains the neoepitope test set for immunogenicity prediction, along with the prediction scores of NeoMHCI and other comparison methods. - BM2023.csv: Corresponds to the BM2023 test set mentioned in the article. - PM2018_data.txt: Corresponds to the PM2018 test set mentioned in the article. It includes `mutation_id` for mutation number, `patient_id` for patient number, `epitope` indicating whether the mutation is immunogenic, `tpm` for the gene expression level of the mutation, `cell_line` for the renamed multi-allele combination of the patient, with the specific correspondence in PM2018_allelelist. `pepseq` represents the specific sequence of the mutation. Each mutation is represented by all 8-11mer slices containing the mutation site, with the highest prediction value among all slices representing the prediction score for that mutation. - PM2018_records.csv: Records the prediction scores of each method for every mutation with TPM>0. - PM2018_allelelist: Records the multi-allele combinations expressed by each patient in PM2018. ------------------------------------------------------ Common: - allelelist: Records the specific MHC-I molecule combinations corresponding to the names of the multi-allele combinations (cell line) used in the MA data. - MHC_pseudo.dat: Records the 34-mer pseudo sequences of MHC-I molecules. - eval.py: Evaluation script used to compile various metrics from the records of each test set.

Files

Institutions

Fudan University

Categories

Immunology, Bioinformatics, Cancer, Epitope, Major Histocompatibility Complex

Licence