Fei Lin1*, Ziyang Gong2, 4*, Cong Wang3*, Yonglin Tian3, Tengchao Zhang1, Xue Yang2, Gen Luo4, Fei-Yue Wang1, 3
1 Macau University of Science and Technology, 2 Shanghai Jiao Tong University
3 Institute of Automation, Chinese Academy of Sciences, 4 Shanghai AI Laboratory
* Equal contribution
This work investigates the capacity of general Multimodal Large Language Models (MLLMs) to perform structure-level molecular refinement for toxicity repair tasks. We present ToxiMol, the first benchmark designed explicitly for this task, which encompasses 11 toxicity remediation tasks involving a total of 560 toxic molecules. We also provide an evaluation framework (ToxiEval) to assess toxicity reduction, structural validity, drug-likeness, and other relevant properties.
- π [2025/06/13] The paper of ToxiMol is released at arXiv, and it will be updated continually!
- π [2025/06/09] We have released the Dataset for ToxiMol at Hugging Face.
- 𧬠Overview
- π Dataset Structure
- π Evaluation
- π Usage
- π«Άπ» Acknowledgement
- β Star History
- π§βπ¬ Citation
The ToxiMol benchmark provides:
- π§ͺ A curated dataset of 560 toxic molecules across 11 task types, including functional group preservation, endpoint-specific detoxification, and mechanism-aware edits.
- π§ An expert-informed Mechanism-Aware Prompt Annotation Pipeline, tailored for general-purpose and chemical-aware models.
The ToxiEval evaluation framework, offers an automated assessment on:
- Safety Score
- Quantitative Estimate of Drug-likeness
- Synthetic Accessibility Score
- Lipinskiβs Rule of Five
- Structural Similarity
We systematically test nearly 30 state-of-the-art MLLMs with diverse architectures and input modalities to assess their ability to perform structure-level molecular toxicity repair.
To construct a representative and challenging benchmark for molecular toxicity repair, we systematically define 11 toxicity repair tasks based on all toxicity prediction tasks under the "Single-instance Prediction Problem" category from the Therapeutics Data Commons (TDC) platform.
The ToxiMol dataset consists of 560 curated toxic molecules covering both binary classification and regression tasks across diverse toxicity mechanisms. The Tox21 dataset retains all of its 12 original sub-tasks, while 10 sub-tasks are randomly selected from the ToxCast dataset. All task names are kept consistent with those in the original datasets.
Dataset | Task Type | Molecules | Description |
---|---|---|---|
AMES | Binary Classification | 50 | Mutagenicity testing via Ames assay |
Carcinogens | Binary Classification | 50 | Carcinogenicity prediction |
ClinTox | Binary Classification | 50 | Clinical toxicity from failed trials |
DILI | Binary Classification | 50 | Drug-induced liver injury |
hERG | Binary Classification | 50 | hERG channel inhibition (cardiotoxicity) |
hERG_Central | Binary Classification | 50 | Large-scale hERG database with cardiac safety profiles |
hERG_Karim | Binary Classification | 50 | Integrated hERG dataset from multiple sources |
LD50_Zhu | Regression (log(LD50)<2) | 50 | Acute toxicity lethal dose prediction |
Skin Reaction | Binary Classification | 50 | Adverse skin sensitization reactions |
Tox21 | Binary Classification (12 sub-tasks) | 60 | Nuclear receptors & stress response pathways (e.g., ARE, p53, ER, AR) |
ToxCast | Binary Classification (10 sub-tasks) | 50 | Diverse toxicity pathways including mitochondrial dysfunction & neurotoxicity |
Each sample is paired with structural detoxification prompts and comprehensive evaluation metadata. The benchmark covers approximately 30 distinct small-molecule toxicity mechanisms, providing a comprehensive testbed for molecular detoxification methods.
You can also access the dataset on Hugging Face:
π https://huggingface.co/datasets/DeepYoke/ToxiMol-benchmark
We propose ToxiEval, a multi-dimensional evaluation protocol consisting of the following metrics:
Metric | Description | Range | Threshold for Success |
---|---|---|---|
Safety Score | Indicates toxicity mitigation, based on TxGemma-Predict classification | 0β1 or binary | =1 (binary) or >0.5 (LD50 task) |
Quantitative Estimate of Drug-likeness (QED) | Drug-likeness score from [0,1]; higher means more drug-like | 0β1 | β₯ 0.5 |
Synthetic Accessibility Score (SAS) | Synthetic feasibility; lower scores are better | 1β10 | β€ 6 |
Lipinskiβs Rule of Five (RO5) | Number of Lipinski rule violations (should be minimal) | Integer (β₯0) | β€ 1 |
Structural Similarity(SS) | Scaffold similarity (Tanimoto) between original and repaired molecules | 0β1 | β₯ 0.4 |
A candidate molecule is considered successfully detoxified only if it satisfies all five criteria simultaneously.
-
Safety Score: Computed using txgemma-9b-predict by default in
evaluation/molecule_utils.py
. Users can choose 2b/9b/27b variants - see ablation studies in our paper. -
SAS Score: Calculated using
evaluation/sascorer.py
andevaluation/fpscores.pkl.gz
from RDKit Contrib. -
TxGemma TDC Prompts: Task-specific prompts stored in
evaluation/tdc_prompts.json
, sourced from TxGemma's official release.
# Clone the repository
git clone https://github.com/DeepYoke/ToxiMol.git --recursive
cd ToxiMol
# Install dependencies
pip install -r requirements.txt
To run DeepSeek-VL V2, we recommend setting up a new Conda virtual environment following the instructions at DeepSeek-VL2 GitHub. Once the environment is activated, please execute the following commands:
cd experiments/opensource/DeepSeek
# Install dependencies
pip install -e .
The ToxiMol dataset is hosted on Hugging Face:
from datasets import load_dataset
# Load a specific task
dataset = load_dataset("DeepYoke/ToxiMol-benchmark", data_dir="ames", split="train", trust_remote_code=True)
Available tasks: ames
, carcinogens_lagunin
, clintox
, dili
, herg
, herg_central
, herg_karim
, ld50_zhu
, skin_reaction
, tox21
, toxcast
For closed-source MLLMs, we provide GPT series as an example. Any GPT model supporting text+image input can be tested (e.g., gpt-4.1
, gpt-4o
, gpt-o3
), provided your API key has access.
# Run single task
python experiments/gpt/run_toxicity_repair.py \
--task ames \
--model gpt-4.1 \
--api-key YOUR_OPENAI_API_KEY
# Run all tasks
python experiments/gpt/run_toxicity_repair.py \
--task all \
--model gpt-4.1 \
--api-key YOUR_OPENAI_API_KEY
# Limit molecules per task (useful for testing)
python experiments/gpt/run_toxicity_repair.py \
--task ames \
--model gpt-4.1 \
--api-key YOUR_OPENAI_API_KEY \
--limit 10
# InternVL3 (recommended)
python experiments/opensource/run_opensource.py \
--task ames \
--model internvl3 \
--model_path OpenGVLab/InternVL3-8B
# DeepSeek-VL V2
python experiments/opensource/run_opensource.py \
--task all \
--model deepseekvl2 \
--model_path deepseek-ai/deepseek-vl2-small
# LLaVA-OneVision
python experiments/opensource/run_opensource.py \
--task clintox \
--model llava-onevision \
--model_path lmms-lab/llava-onevision-qwen2-7b-ov
# Qwen2.5-VL
python experiments/opensource/run_opensource.py \
--task herg \
--model qwen2.5vl \
--model_path Qwen/Qwen2.5-VL-7B-Instruct
Available Tasks: ames
, carcinogens_lagunin
, clintox
, dili
, herg
, herg_central
, herg_karim
, ld50_zhu
, skin_reaction
, tox21
, toxcast
, all
Supported Models: internvl3
, deepseekvl2
, llava-onevision
, qwen2.5vl
After running experiments, evaluate the results using our ToxiEval framework with the --full
parameter:
python evaluation/run_evaluation.py \
--results-dir experiments/opensource/results \
--model InternVL3-8B \
--full
# Evaluate specific model and task
python evaluation/run_evaluation.py \
--results-dir experiments/gpt/results \
--model gpt-4.1 \
--task ames \
--full
# Evaluate all tasks for a model
python evaluation/run_evaluation.py \
--results-dir experiments/gpt/results \
--model gpt-4.1 \
--full
# Evaluate open-source model results
python evaluation/run_evaluation.py \
--results-dir experiments/opensource/results \
--model llava-one-vision-72b \
--full
Results are organized as follows:
experiments/
βββ gpt/results/
β βββ gpt-4.1/
β βββ ames/ames_results.json
β βββ clintox/clintox_results.json
β βββ overall_summary.json
βββ opensource/results/
βββ llava-one-vision-72b/
βββ ames/ames_results.json
βββ herg/herg_results.json
βββ overall_summary.json
experiments/eval_results/
βββ Qwen2.5-VL-32B-Instruct/
βββ all_tasks/
βββ all_tasks_evaluation_summary.json
βββ all_tasks_evaluation_summary.csv
βββ tox21_subtasks_evaluation_summary.json
βββ tox21_subtasks_evaluation_summary.csv
βββ toxcast_subtasks_evaluation_summary.json
βββ toxcast_subtasks_evaluation_summary.csv
Custom Generation Parameters:
python experiments/opensource/run_opensource.py \
--task ames \
--model internvl3 \
--model_path OpenGVLab/InternVL3-8B \
--temperature 0.7 \
--max-tokens 1024
Process Specific Molecules:
python experiments/gpt/run_toxicity_repair.py \
--task ames \
--model gpt-4.1 \
--api-key YOUR_OPENAI_API_KEY \
--molecule-ids 1 5 10 15
Custom Output Directory:
python evaluation/run_evaluation.py \
--results-dir experiments/opensource/results \
--output-dir custom_eval_results \
--full
If the code fails to extract SMILES, please manually extract:
python evaluation/extract_smiles.py \
--results-dir experiments/opensource/results/model_type
If you encounter the error of "TypeError: process_vision_info() got an unexpected keyword argument", please try:
pip install qwen-vl-utils==0.0.10
We sincerely thank the developers and contributors of the following tools and resources, which made this project possible. This project makes use of several external assets for molecular processing and evaluation. All assets are used in accordance with their respective licenses and terms of use:
Used for toxicity datasets that form the foundation of the ToxiMol benchmark. Provided by Therapeutics Data Commons.
- License: MIT License
- Official Website: tdcommons.ai
- Official GitHub: github.com/mims-harvard/TDC
- Papers:
- TDC-2: Multimodal foundation for therapeutic science (bioRxiv 2024)
- Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development (arXiv 2021)
- Artificial intelligence foundation for therapeutic science (Nature Chemical Biology 2022)
Used for toxicity prediction tasks. Provided by Google via the Hugging Face Transformers library.
- License: Health AI Developer Foundations Terms of Use
- Official GitHub: gemma-cookbook
- Paper: TxGemma: Efficient and Agentic LLMs for Therapeutics
Used for computing QED, Lipinskiβs Rule of Five (RO5), molecular similarity, and other molecular operations.
- Version: 2023.09.6
- License: BSD 3-Clause
- Official GitHub: github.com/rdkit
- Website: rdkit.org
Used to evaluate the synthetic feasibility of generated molecules. Implementation from RDKit Contrib directory by Peter Ertl and Greg Landrum.
- License: BSD 3-Clause
- Official GitHub: github.com/rdkit/rdkit/tree/master/Contrib/SA_Score
- Paper: Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions (Journal of Cheminformatics 2009)
If you use this benchmark, please cite:
@misc{lin2025breakingbadmoleculesmllms,
title={Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?},
author={Fei Lin and Ziyang Gong and Cong Wang and Yonglin Tian and Tengchao Zhang and Xue Yang and Gen Luo and Fei-Yue Wang},
year={2025},
eprint={2506.10912},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.10912},
}