Ames Mutagenicity Dataset for Multi-Task Learning
The dataset contains 5536 molecular compounds represented by their SMILES code and 1360 molecular descriptors calculated with Mordred. In addition to this data, it contains the respective labels for each compound (1: mutagenic / 0: non-mutagenic) for each of the five strains (TA98, TA100, TA102, TA1535, TA1537) and a general label (Overall) that corresponds to the ground-truth consensus label used for evaluating the final Ames mutagenicity prediction. There is a third label on the dataset: -1 (undefined). Those compounds that have an undefined label in one or more strains, but have either positive or negative labels in the remaining strains can be considered in the multi-task modeling process. The compounds listed were originally compiled by the Istituto Superiore di Sanita’ (https://www.iss.it/isstox) and result from an exhaustive pre-processing stage, consisting of different filtering, sanitization, and canonicalization steps. The dataset aims to be a source for QSAR modeling of Ames mutagenicity. It provides information on the mutagenic potential for a variety of S. Typhimurium strains and the Overall label. The .csv file contains the necessary information to be used for predictive modeling tasks. The columns "TA98", "TA100", "TA102", "TA1535", "TA1537" correspond to the labels calculated for each strain, while the column "Overall" corresponds to the consensus label used to evaluate the final prediction of mutagenicity. The "Partition" column shows three values: "Train", "Internal" and "External", which allows identifying the data partition to which each compound was assigned during our experimentation process. All remaining columns correspond to molecular descriptors calculated using Mordred.