Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES

Published: 27 January 2025| Version 1 | DOI: 10.17632/27j2zg6f5x.1
Contributors:
Obaid Alhmoudi,

Description

This repository supports the manuscript “Cost-Efficient Repurposing of a Monolingual SMILES-Based Chemical Transformer to SELFIES,” providing all necessary data, models, and code for reproducing the reported experiments and figures. It includes two core datasets (SMILES_to_SELFIES.csv and Filtered_QM9.csv) for SELFIES-based finetuning and QM9 regression, along with a zip archive (selfies_finetuned_model.zip) containing the final ChemBERTa model finetuned on SELFIES. Also provided are four Jupyter notebooks—Finetuning and Figures.ipynb, QM9 regression: SELFIES FT model.ipynb, QM9 regression: ChemBERTa-77M-MLM model.ipynb, and QM9 Regression: ChemBERTa-zinc-base-v1 model.ipynb—which illustrate the steps to generate all analysis, plots, and performance metrics. Each notebook includes code and outputs showing the end-to-end methodology, from data preparation through model evaluation.

Files

Steps to reproduce

1. Finetuning and Figures.ipynb Demonstrates how the SELFIES-based fine-tuning was conducted and how the manuscript’s figures were generated. Link: https://colab.research.google.com/drive/19OKVBugflvfIg_PFfeKclteiU7GypC4r?usp=sharing 2. QM9 regression: SELFIES FT model.ipynb Shows how the SELFIES-finetuned ChemBERTa model was applied to predict 12 QM9 properties. Link: https://colab.research.google.com/drive/1ECwKUutl-eS3jAhRnAeJC1sJepY2X1qF?usp=sharing 3. QM9 regression: ChemBERTa-77M-MLM model.ipynb Outlines the same QM9 regression tasks using the larger ChemBERTa-77M-MLM model. Link: https://colab.research.google.com/drive/1LJIa1LPSpnt8xmI1FAE_6D1NH8b1hHO6?usp=sharing 4. QM9 Regression: ChemBERTa-zinc-base-v1 model.ipynb Provides a baseline comparison by performing QM9 regression with the original ChemBERTa-zinc-base-v1 model. Link: https://colab.research.google.com/drive/1O3nLWSnPaooTW1V7280tBLuwK7SEVVpe?usp=sharing

Institutions

Khalifa University of Science and Technology

Categories

Chemical Engineering, Artificial Intelligence, Transformer LLM

Licence