Ames Mutagenicity Dataset for Multi-Task Learning

Name: Ames Mutagenicity Dataset for Multi-Task Learning
Creator: María Jimena Martínez
Published: 2022-08-05T16:03:19.931Z
Keywords: Mutagenicity, Quantitative Structure-Activity Relationship, Deep Learning

Martínez, María Jimena; Sabando, María Virginia; Soto, Axel; Roca, Carlos; Requena-Triguero, Carlos; Campillo, Nuria; Páez, Juan; Ponzoni, Ignacio

doi:10.17632/ktc6gbfsbh.2

Ames Mutagenicity Dataset for Multi-Task Learning

Published: 5 August 2022| Version 2 | DOI: 10.17632/ktc6gbfsbh.2

Contributors:

María Jimena Martínez, María Virginia Sabando, Axel Soto, Carlos Roca, Carlos Requena-Triguero, Nuria Campillo, Juan Páez, Ignacio Ponzoni

Description

The dataset contains 5536 molecular compounds represented by their SMILES code and 1360 molecular descriptors calculated with Mordred. In addition to this data, it contains the respective labels for each compound (1: mutagenic / 0: non-mutagenic) for each of the five strains (TA98, TA100, TA102, TA1535, TA1537) and a general label (Overall) that corresponds to the ground-truth consensus label used for evaluating the final Ames mutagenicity prediction. There is a third label on the dataset: -1 (undefined). Those compounds that have an undefined label in one or more strains, but have either positive or negative labels in the remaining strains can be considered in the multi-task modeling process. The compounds listed were originally compiled by the Istituto Superiore di Sanita’ (https://www.iss.it/isstox) and result from an exhaustive pre-processing stage, consisting of different filtering, sanitization, and canonicalization steps. The dataset aims to be a source for QSAR modeling of Ames mutagenicity. It provides information on the mutagenic potential for a variety of S. Typhimurium strains and the Overall label. The .csv file contains the necessary information to be used for predictive modeling tasks. The columns "TA98", "TA100", "TA102", "TA1535", "TA1537" correspond to the labels calculated for each strain, while the column "Overall" corresponds to the consensus label used to evaluate the final prediction of mutagenicity. The "Partition" column shows three values: "Train", "Internal" and "External", which allows identifying the data partition to which each compound was assigned during our experimentation process. All remaining columns correspond to molecular descriptors calculated using Mordred.

Files

Institutions

Universidad Nacional del Sur
Universidad Nacional del Centro de la Provincia de Buenos Aires
Consejo Superior de Investigaciones Cientificas

Ames Mutagenicity Dataset for Multi-Task Learning

Description

Files

Institutions

Categories

Licence