Reference Dataset for Text Mining Type 2 Diabetes Candidate Genes

Published: 22 January 2024| Version 2 | DOI: 10.17632/23n5xfjhyt.2
Sushrutha Raj, Vindhya Namdeo, Sushmitha Raj, Alok Srivastava


The present disease-gene association data contains evidence or reference sentences which contain this disease-gene association information, which is further classified into 4 classes: Yes, No, Ambiguous and X each pertaining to Positive, Negative, Ambiguous and Not related disease-gene associations respectively. This data serves as reference data for the training text mining-based biological literature classifiers which can be used to predict classes of published literature, not just for Type 2 diabetes, but can also be expanded beyond to encompass a wide range of disease and their complications. The compilation of positively associated genes derived from these predictions can then be utilized for in-depth system-level analysis of T2D.



LV Prasad Eye Institute, Amity University Haryana


Natural Language Processing, Machine Learning, Text Mining, Deep Learning, Data Validation