Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin^1*, Ziyang Gong^{2, 4*}, Cong Wang^3*, Yonglin Tian³, Tengchao Zhang¹, Xue Yang², Gen Luo⁴, Fei-Yue Wang^{1, 3}

¹ Macau University of Science and Technology, ² Shanghai Jiao Tong University

³ Institute of Automation, Chinese Academy of Sciences, ⁴ Shanghai AI Laboratory

^* Equal contribution

This work investigates the capacity of general Multimodal Large Language Models (MLLMs) to perform structure-level molecular refinement for toxicity repair tasks. We present ToxiMol, the first benchmark designed explicitly for this task, which encompasses 11 toxicity remediation tasks involving a total of 560 toxic molecules. We also provide an evaluation framework (ToxiEval) to assess toxicity reduction, structural validity, drug-likeness, and other relevant properties.

🔥🔥🔥 News

📚 [2025/06/13] The paper of ToxiMol is released at arXiv, and it will be updated continually!
📊 [2025/06/09] We have released the Dataset for ToxiMol at Hugging Face.

🧬 Overview

The ToxiMol benchmark provides:

🧪 A curated dataset of 560 toxic molecules across 11 task types, including functional group preservation, endpoint-specific detoxification, and mechanism-aware edits.
🧭 An expert-informed Mechanism-Aware Prompt Annotation Pipeline, tailored for general-purpose and chemical-aware models.

The ToxiEval evaluation framework, offers an automated assessment on:

Safety Score
Quantitative Estimate of Drug-likeness
Synthetic Accessibility Score
Lipinski’s Rule of Five
Structural Similarity

We systematically test nearly 30 state-of-the-art MLLMs with diverse architectures and input modalities to assess their ability to perform structure-level molecular toxicity repair.

📂 Dataset Structure

To construct a representative and challenging benchmark for molecular toxicity repair, we systematically define 11 toxicity repair tasks based on all toxicity prediction tasks under the "Single-instance Prediction Problem" category from the Therapeutics Data Commons (TDC) platform.

The ToxiMol dataset consists of 560 curated toxic molecules covering both binary classification and regression tasks across diverse toxicity mechanisms. The Tox21 dataset retains all of its 12 original sub-tasks, while 10 sub-tasks are randomly selected from the ToxCast dataset. All task names are kept consistent with those in the original datasets.

Dataset	Task Type	Molecules	Description
AMES	Binary Classification	50	Mutagenicity testing via Ames assay
Carcinogens	Binary Classification	50	Carcinogenicity prediction
ClinTox	Binary Classification	50	Clinical toxicity from failed trials
DILI	Binary Classification	50	Drug-induced liver injury
hERG	Binary Classification	50	hERG channel inhibition (cardiotoxicity)
hERG_Central	Binary Classification	50	Large-scale hERG database with cardiac safety profiles
hERG_Karim	Binary Classification	50	Integrated hERG dataset from multiple sources
LD50_Zhu	Regression (log(LD50)<2)	50	Acute toxicity lethal dose prediction
Skin Reaction	Binary Classification	50	Adverse skin sensitization reactions
Tox21	Binary Classification (12 sub-tasks)	60	Nuclear receptors & stress response pathways (e.g., ARE, p53, ER, AR)
ToxCast	Binary Classification (10 sub-tasks)	50	Diverse toxicity pathways including mitochondrial dysfunction & neurotoxicity

Each sample is paired with structural detoxification prompts and comprehensive evaluation metadata. The benchmark covers approximately 30 distinct small-molecule toxicity mechanisms, providing a comprehensive testbed for molecular detoxification methods.

You can also access the dataset on Hugging Face:
👉 https://huggingface.co/datasets/DeepYoke/ToxiMol-benchmark

📊 Evaluation

We propose ToxiEval, a multi-dimensional evaluation protocol consisting of the following metrics:

Metric	Description	Range	Threshold for Success
Safety Score	Indicates toxicity mitigation, based on TxGemma-Predict classification	0–1 or binary	=1 (binary) or >0.5 (LD50 task)
Quantitative Estimate of Drug-likeness (QED)	Drug-likeness score from [0,1]; higher means more drug-like	0–1	≥ 0.5
Synthetic Accessibility Score (SAS)	Synthetic feasibility; lower scores are better	1–10	≤ 6
Lipinski’s Rule of Five (RO5)	Number of Lipinski rule violations (should be minimal)	Integer (≥0)	≤ 1
Structural Similarity(SS)	Scaffold similarity (Tanimoto) between original and repaired molecules	0–1	≥ 0.4

A candidate molecule is considered successfully detoxified only if it satisfies all five criteria simultaneously.

Implementation Details

Safety Score: Computed using txgemma-9b-predict by default in evaluation/molecule_utils.py. Users can choose 2b/9b/27b variants - see ablation studies in our paper.
SAS Score: Calculated using evaluation/sascorer.py and evaluation/fpscores.pkl.gz from RDKit Contrib.
TxGemma TDC Prompts: Task-specific prompts stored in evaluation/tdc_prompts.json, sourced from TxGemma's official release.

🛠 Usage

🚀 Quick Start

# Clone the repository
git clone https://github.com/DeepYoke/ToxiMol.git --recursive
cd ToxiMol

# Install dependencies
pip install -r requirements.txt

To run DeepSeek-VL V2, we recommend setting up a new Conda virtual environment following the instructions at DeepSeek-VL2 GitHub. Once the environment is activated, please execute the following commands:

cd experiments/opensource/DeepSeek
# Install dependencies
pip install -e .

📊 Dataset Access

The ToxiMol dataset is hosted on Hugging Face:

from datasets import load_dataset

# Load a specific task
dataset = load_dataset("DeepYoke/ToxiMol-benchmark", data_dir="ames", split="train", trust_remote_code=True)

Available tasks: ames, carcinogens_lagunin, clintox, dili, herg, herg_central, herg_karim, ld50_zhu, skin_reaction, tox21, toxcast

🤖 Running Experiments

Option 1: OpenAI GPT Models

For closed-source MLLMs, we provide GPT series as an example. Any GPT model supporting text+image input can be tested (e.g., gpt-4.1, gpt-4o, gpt-o3), provided your API key has access.

# Run single task
python experiments/gpt/run_toxicity_repair.py \
    --task ames \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY

# Run all tasks
python experiments/gpt/run_toxicity_repair.py \
    --task all \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY

# Limit molecules per task (useful for testing)
python experiments/gpt/run_toxicity_repair.py \
    --task ames \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY \
    --limit 10

Option 2: Open-Source MLLMs

# InternVL3 (recommended)
python experiments/opensource/run_opensource.py \
    --task ames \
    --model internvl3 \
    --model_path OpenGVLab/InternVL3-8B

# DeepSeek-VL V2
python experiments/opensource/run_opensource.py \
    --task all \
    --model deepseekvl2 \
    --model_path deepseek-ai/deepseek-vl2-small

# LLaVA-OneVision
python experiments/opensource/run_opensource.py \
    --task clintox \
    --model llava-onevision \
    --model_path lmms-lab/llava-onevision-qwen2-7b-ov

# Qwen2.5-VL
python experiments/opensource/run_opensource.py \
    --task herg \
    --model qwen2.5vl \
    --model_path Qwen/Qwen2.5-VL-7B-Instruct

Available Tasks: ames, carcinogens_lagunin, clintox, dili, herg, herg_central, herg_karim, ld50_zhu, skin_reaction, tox21, toxcast, all

Supported Models: internvl3, deepseekvl2, llava-onevision, qwen2.5vl

📈 Evaluation

After running experiments, evaluate the results using our ToxiEval framework with the --full parameter:

python evaluation/run_evaluation.py \
    --results-dir experiments/opensource/results \
    --model InternVL3-8B \
    --full

Example Commands

# Evaluate specific model and task
python evaluation/run_evaluation.py \
    --results-dir experiments/gpt/results \
    --model gpt-4.1 \
    --task ames \
    --full

# Evaluate all tasks for a model
python evaluation/run_evaluation.py \
    --results-dir experiments/gpt/results \
    --model gpt-4.1 \
    --full

# Evaluate open-source model results
python evaluation/run_evaluation.py \
    --results-dir experiments/opensource/results \
    --model llava-one-vision-72b \
    --full

📁 Output Structure

Results are organized as follows:

Experiment Results (Raw Model Outputs):

experiments/
├── gpt/results/
│   └── gpt-4.1/
│       ├── ames/ames_results.json
│       ├── clintox/clintox_results.json
│       └── overall_summary.json
└── opensource/results/
    └── llava-one-vision-72b/
        ├── ames/ames_results.json
        ├── herg/herg_results.json
        └── overall_summary.json

Evaluation Results (ToxiEval Framework Outputs):

experiments/eval_results/
└── Qwen2.5-VL-32B-Instruct/
    └── all_tasks/
        ├── all_tasks_evaluation_summary.json
        ├── all_tasks_evaluation_summary.csv
        ├── tox21_subtasks_evaluation_summary.json
        ├── tox21_subtasks_evaluation_summary.csv
        ├── toxcast_subtasks_evaluation_summary.json
        └── toxcast_subtasks_evaluation_summary.csv

⚡ Advanced Usage

Custom Generation Parameters:

python experiments/opensource/run_opensource.py \
    --task ames \
    --model internvl3 \
    --model_path OpenGVLab/InternVL3-8B \
    --temperature 0.7 \
    --max-tokens 1024

Process Specific Molecules:

python experiments/gpt/run_toxicity_repair.py \
    --task ames \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY \
    --molecule-ids 1 5 10 15

Custom Output Directory:

python evaluation/run_evaluation.py \
    --results-dir experiments/opensource/results \
    --output-dir custom_eval_results \
    --full

👀 Q&As

If the code fails to extract SMILES, please manually extract:

python evaluation/extract_smiles.py \
    --results-dir experiments/opensource/results/model_type

If you encounter the error of "TypeError: process_vision_info() got an unexpected keyword argument", please try:

pip install qwen-vl-utils==0.0.10

🫶🏻 Acknowledgement

We sincerely thank the developers and contributors of the following tools and resources, which made this project possible. This project makes use of several external assets for molecular processing and evaluation. All assets are used in accordance with their respective licenses and terms of use:

TDC

Used for toxicity datasets that form the foundation of the ToxiMol benchmark. Provided by Therapeutics Data Commons.

License: MIT License
Official Website: tdcommons.ai
Official GitHub: github.com/mims-harvard/TDC
Papers:
- TDC-2: Multimodal foundation for therapeutic science (bioRxiv 2024)
- Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development (arXiv 2021)
- Artificial intelligence foundation for therapeutic science (Nature Chemical Biology 2022)

TxGemma

Used for toxicity prediction tasks. Provided by Google via the Hugging Face Transformers library.

License: Health AI Developer Foundations Terms of Use
Official GitHub: gemma-cookbook
Paper: TxGemma: Efficient and Agentic LLMs for Therapeutics

RDKit

Used for computing QED, Lipinski’s Rule of Five (RO5), molecular similarity, and other molecular operations.

Version: 2023.09.6
License: BSD 3-Clause
Official GitHub: github.com/rdkit
Website: rdkit.org

Synthetic Accessibility Score (SAS)

Used to evaluate the synthetic feasibility of generated molecules. Implementation from RDKit Contrib directory by Peter Ertl and Greg Landrum.

License: BSD 3-Clause
Official GitHub: github.com/rdkit/rdkit/tree/master/Contrib/SA_Score
Paper: Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions (Journal of Cheminformatics 2009)

⭐ Star History

🧑‍🔬 Citation

If you use this benchmark, please cite:

@misc{lin2025breakingbadmoleculesmllms,
      title={Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?}, 
      author={Fei Lin and Ziyang Gong and Cong Wang and Yonglin Tian and Tengchao Zhang and Xue Yang and Gen Luo and Fei-Yue Wang},
      year={2025},
      eprint={2506.10912},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.10912}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
annotation		annotation
evaluation		evaluation
experiments		experiments
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
ToxiMol_show.png		ToxiMol_show.png
logo.png		logo.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

🔥🔥🔥 News

📚 Table of Contents

🧬 Overview

📂 Dataset Structure

📊 Evaluation

Implementation Details

🛠 Usage

🚀 Quick Start

📊 Dataset Access

🤖 Running Experiments

Option 1: OpenAI GPT Models

Option 2: Open-Source MLLMs

📈 Evaluation

Example Commands

📁 Output Structure

Experiment Results (Raw Model Outputs):

Evaluation Results (ToxiEval Framework Outputs):

⚡ Advanced Usage

👀 Q&As

🫶🏻 Acknowledgement

TDC

TxGemma

RDKit

Synthetic Accessibility Score (SAS)

⭐ Star History

🧑‍🔬 Citation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

DeepYoke/ToxiMol

Folders and files

Latest commit

History

Repository files navigation

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

🔥🔥🔥 News

📚 Table of Contents

🧬 Overview

📂 Dataset Structure

📊 Evaluation

Implementation Details

🛠 Usage

🚀 Quick Start

📊 Dataset Access

🤖 Running Experiments

Option 1: OpenAI GPT Models

Option 2: Open-Source MLLMs

📈 Evaluation

Example Commands

📁 Output Structure

Experiment Results (Raw Model Outputs):

Evaluation Results (ToxiEval Framework Outputs):

⚡ Advanced Usage

👀 Q&As

🫶🏻 Acknowledgement

TDC

TxGemma

RDKit

Synthetic Accessibility Score (SAS)

⭐ Star History

🧑‍🔬 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages