Data and codes for RoBERTa-base repurposing to SELFIES chemical notation

Published: 7 October 2025| Version 2 | DOI: 10.17632/3c27p5pzts.2
Contributors:
Obaid Alhmoudi,

Description

This repository contains four Jupyter/Colab notebooks that reproduce the analyses reported in the manuscript. The Stage 1 notebook (Domain Adaptation) performs masked language modeling to adapt RoBERTa-base to a corpus of SELFIES strings, aligning the encoder with SELFIES grammar without introducing a new vocabulary. Stage 2 (Finetuning) applies compact supervision on seven QM9 quantum-chemical properties, shaping the embedding space toward structure–property relations. The REFPROP notebook generates thermophysical and transport property grids for 88 fluids from NIST REFPROP, producing the 108k vapor-phase state points used for further evaluation. A fourth notebook (Other Codes) contains supporting scripts for embedding extraction and mean-pooling, chemotype clustering, silhouette analysis, Mantel correlation testing, and multi-output regression with (T,P), RDKit descriptors, MoLFormer-XL-10pct, and SELFIES–QM9 embeddings. These codes enable reproduction of the training pipeline, property extraction, and evaluation protocols described in the study.

Files

Steps to reproduce

1. Stage 1 (Domain Adaptation): https://colab.research.google.com/drive/1JgugGEok-FZUchAb4zfYYDLBKflLY1bc?usp=sharing Implements domain-adaptive masked language modeling of RoBERTa-base on SELFIES corpus. This stage aligns the encoder with the SELFIES grammar without introducing new vocabulary. 2. Stage 2 (Finetuning): https://colab.research.google.com/drive/1pBo3lR5-mLDywKLvKBIqBCduoswHAI8p?usp=sharing Performs STILTs-style finetuning on seven QM9 quantum-chemical properties (polarizability, HOMO, LUMO, electronic extent, zero-point energy, Gibbs free energy, heat capacity), shaping the embedding space toward structure–property relationships. 3. REFPROP Code REFPROP_code.ipynb Generates thermophysical and transport property grids for 88 fluids from NIST REFPROP, producing the 108k vapor-phase state points. 4. All other Codes: https://colab.research.google.com/drive/10mefuiwn5MEO7cf7_CDhqYkJoUGCxiRy?usp=sharing

Institutions

  • Khalifa University of Science and Technology

Categories

Chemical Engineering, Cheminformatics, Encoder LLM

Licence