Skip to content

DeepYoke/ToxiMol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin1*, Ziyang Gong2, 4*, Cong Wang3*, Yonglin Tian3, Tengchao Zhang1, Xue Yang2, Gen Luo4, Fei-Yue Wang1, 3

1 Macau University of Science and Technology, 2 Shanghai Jiao Tong University

3 Institute of Automation, Chinese Academy of Sciences, 4 Shanghai AI Laboratory

* Equal contribution

Visitors GitHub Stars 🌐 Project Page Hugging Face Paper


This work investigates the capacity of general Multimodal Large Language Models (MLLMs) to perform structure-level molecular refinement for toxicity repair tasks. We present ToxiMol, the first benchmark designed explicitly for this task, which encompasses 11 toxicity remediation tasks involving a total of 560 toxic molecules. We also provide an evaluation framework (ToxiEval) to assess toxicity reduction, structural validity, drug-likeness, and other relevant properties.






πŸ”₯πŸ”₯πŸ”₯ News

  • πŸ“š [2025/06/13] The paper of ToxiMol is released at arXiv, and it will be updated continually!
  • πŸ“Š [2025/06/09] We have released the Dataset for ToxiMol at Hugging Face.

πŸ“š Table of Contents

🧬 Overview

The ToxiMol benchmark provides:

  • πŸ§ͺ A curated dataset of 560 toxic molecules across 11 task types, including functional group preservation, endpoint-specific detoxification, and mechanism-aware edits.
  • 🧭 An expert-informed Mechanism-Aware Prompt Annotation Pipeline, tailored for general-purpose and chemical-aware models.

The ToxiEval evaluation framework, offers an automated assessment on:

  • Safety Score
  • Quantitative Estimate of Drug-likeness
  • Synthetic Accessibility Score
  • Lipinski’s Rule of Five
  • Structural Similarity

We systematically test nearly 30 state-of-the-art MLLMs with diverse architectures and input modalities to assess their ability to perform structure-level molecular toxicity repair.

πŸ“‚ Dataset Structure

To construct a representative and challenging benchmark for molecular toxicity repair, we systematically define 11 toxicity repair tasks based on all toxicity prediction tasks under the "Single-instance Prediction Problem" category from the Therapeutics Data Commons (TDC) platform.

The ToxiMol dataset consists of 560 curated toxic molecules covering both binary classification and regression tasks across diverse toxicity mechanisms. The Tox21 dataset retains all of its 12 original sub-tasks, while 10 sub-tasks are randomly selected from the ToxCast dataset. All task names are kept consistent with those in the original datasets.

Dataset Task Type Molecules Description
AMES Binary Classification 50 Mutagenicity testing via Ames assay
Carcinogens Binary Classification 50 Carcinogenicity prediction
ClinTox Binary Classification 50 Clinical toxicity from failed trials
DILI Binary Classification 50 Drug-induced liver injury
hERG Binary Classification 50 hERG channel inhibition (cardiotoxicity)
hERG_Central Binary Classification 50 Large-scale hERG database with cardiac safety profiles
hERG_Karim Binary Classification 50 Integrated hERG dataset from multiple sources
LD50_Zhu Regression (log(LD50)<2) 50 Acute toxicity lethal dose prediction
Skin Reaction Binary Classification 50 Adverse skin sensitization reactions
Tox21 Binary Classification (12 sub-tasks) 60 Nuclear receptors & stress response pathways (e.g., ARE, p53, ER, AR)
ToxCast Binary Classification (10 sub-tasks) 50 Diverse toxicity pathways including mitochondrial dysfunction & neurotoxicity

Each sample is paired with structural detoxification prompts and comprehensive evaluation metadata. The benchmark covers approximately 30 distinct small-molecule toxicity mechanisms, providing a comprehensive testbed for molecular detoxification methods.

You can also access the dataset on Hugging Face:
πŸ‘‰ https://huggingface.co/datasets/DeepYoke/ToxiMol-benchmark

πŸ“Š Evaluation

We propose ToxiEval, a multi-dimensional evaluation protocol consisting of the following metrics:

Metric Description Range Threshold for Success
Safety Score Indicates toxicity mitigation, based on TxGemma-Predict classification 0–1 or binary =1 (binary) or >0.5 (LD50 task)
Quantitative Estimate of Drug-likeness (QED) Drug-likeness score from [0,1]; higher means more drug-like 0–1 β‰₯ 0.5
Synthetic Accessibility Score (SAS) Synthetic feasibility; lower scores are better 1–10 ≀ 6
Lipinski’s Rule of Five (RO5) Number of Lipinski rule violations (should be minimal) Integer (β‰₯0) ≀ 1
Structural Similarity(SS) Scaffold similarity (Tanimoto) between original and repaired molecules 0–1 β‰₯ 0.4

A candidate molecule is considered successfully detoxified only if it satisfies all five criteria simultaneously.

Implementation Details


πŸ›  Usage

πŸš€ Quick Start

# Clone the repository
git clone https://github.com/DeepYoke/ToxiMol.git --recursive
cd ToxiMol

# Install dependencies
pip install -r requirements.txt

To run DeepSeek-VL V2, we recommend setting up a new Conda virtual environment following the instructions at DeepSeek-VL2 GitHub. Once the environment is activated, please execute the following commands:

cd experiments/opensource/DeepSeek
# Install dependencies
pip install -e .

πŸ“Š Dataset Access

The ToxiMol dataset is hosted on Hugging Face:

from datasets import load_dataset

# Load a specific task
dataset = load_dataset("DeepYoke/ToxiMol-benchmark", data_dir="ames", split="train", trust_remote_code=True)

Available tasks: ames, carcinogens_lagunin, clintox, dili, herg, herg_central, herg_karim, ld50_zhu, skin_reaction, tox21, toxcast

πŸ€– Running Experiments

Option 1: OpenAI GPT Models

For closed-source MLLMs, we provide GPT series as an example. Any GPT model supporting text+image input can be tested (e.g., gpt-4.1, gpt-4o, gpt-o3), provided your API key has access.

# Run single task
python experiments/gpt/run_toxicity_repair.py \
    --task ames \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY

# Run all tasks
python experiments/gpt/run_toxicity_repair.py \
    --task all \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY

# Limit molecules per task (useful for testing)
python experiments/gpt/run_toxicity_repair.py \
    --task ames \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY \
    --limit 10

Option 2: Open-Source MLLMs

# InternVL3 (recommended)
python experiments/opensource/run_opensource.py \
    --task ames \
    --model internvl3 \
    --model_path OpenGVLab/InternVL3-8B

# DeepSeek-VL V2
python experiments/opensource/run_opensource.py \
    --task all \
    --model deepseekvl2 \
    --model_path deepseek-ai/deepseek-vl2-small

# LLaVA-OneVision
python experiments/opensource/run_opensource.py \
    --task clintox \
    --model llava-onevision \
    --model_path lmms-lab/llava-onevision-qwen2-7b-ov

# Qwen2.5-VL
python experiments/opensource/run_opensource.py \
    --task herg \
    --model qwen2.5vl \
    --model_path Qwen/Qwen2.5-VL-7B-Instruct

Available Tasks: ames, carcinogens_lagunin, clintox, dili, herg, herg_central, herg_karim, ld50_zhu, skin_reaction, tox21, toxcast, all

Supported Models: internvl3, deepseekvl2, llava-onevision, qwen2.5vl

πŸ“ˆ Evaluation

After running experiments, evaluate the results using our ToxiEval framework with the --full parameter:

python evaluation/run_evaluation.py \
    --results-dir experiments/opensource/results \
    --model InternVL3-8B \
    --full

Example Commands

# Evaluate specific model and task
python evaluation/run_evaluation.py \
    --results-dir experiments/gpt/results \
    --model gpt-4.1 \
    --task ames \
    --full

# Evaluate all tasks for a model
python evaluation/run_evaluation.py \
    --results-dir experiments/gpt/results \
    --model gpt-4.1 \
    --full

# Evaluate open-source model results
python evaluation/run_evaluation.py \
    --results-dir experiments/opensource/results \
    --model llava-one-vision-72b \
    --full

πŸ“ Output Structure

Results are organized as follows:

Experiment Results (Raw Model Outputs):

experiments/
β”œβ”€β”€ gpt/results/
β”‚   └── gpt-4.1/
β”‚       β”œβ”€β”€ ames/ames_results.json
β”‚       β”œβ”€β”€ clintox/clintox_results.json
β”‚       └── overall_summary.json
└── opensource/results/
    └── llava-one-vision-72b/
        β”œβ”€β”€ ames/ames_results.json
        β”œβ”€β”€ herg/herg_results.json
        └── overall_summary.json

Evaluation Results (ToxiEval Framework Outputs):

experiments/eval_results/
└── Qwen2.5-VL-32B-Instruct/
    └── all_tasks/
        β”œβ”€β”€ all_tasks_evaluation_summary.json
        β”œβ”€β”€ all_tasks_evaluation_summary.csv
        β”œβ”€β”€ tox21_subtasks_evaluation_summary.json
        β”œβ”€β”€ tox21_subtasks_evaluation_summary.csv
        β”œβ”€β”€ toxcast_subtasks_evaluation_summary.json
        └── toxcast_subtasks_evaluation_summary.csv

⚑ Advanced Usage

Custom Generation Parameters:

python experiments/opensource/run_opensource.py \
    --task ames \
    --model internvl3 \
    --model_path OpenGVLab/InternVL3-8B \
    --temperature 0.7 \
    --max-tokens 1024

Process Specific Molecules:

python experiments/gpt/run_toxicity_repair.py \
    --task ames \
    --model gpt-4.1 \
    --api-key YOUR_OPENAI_API_KEY \
    --molecule-ids 1 5 10 15

Custom Output Directory:

python evaluation/run_evaluation.py \
    --results-dir experiments/opensource/results \
    --output-dir custom_eval_results \
    --full

πŸ‘€ Q&As

If the code fails to extract SMILES, please manually extract:

python evaluation/extract_smiles.py \
    --results-dir experiments/opensource/results/model_type

If you encounter the error of "TypeError: process_vision_info() got an unexpected keyword argument", please try:

pip install qwen-vl-utils==0.0.10

🫢🏻 Acknowledgement

We sincerely thank the developers and contributors of the following tools and resources, which made this project possible. This project makes use of several external assets for molecular processing and evaluation. All assets are used in accordance with their respective licenses and terms of use:

TDC

Used for toxicity datasets that form the foundation of the ToxiMol benchmark. Provided by Therapeutics Data Commons.

TxGemma

Used for toxicity prediction tasks. Provided by Google via the Hugging Face Transformers library.

RDKit

Used for computing QED, Lipinski’s Rule of Five (RO5), molecular similarity, and other molecular operations.

Synthetic Accessibility Score (SAS)

Used to evaluate the synthetic feasibility of generated molecules. Implementation from RDKit Contrib directory by Peter Ertl and Greg Landrum.


⭐ Star History

Star History Chart


πŸ§‘β€πŸ”¬ Citation

If you use this benchmark, please cite:

@misc{lin2025breakingbadmoleculesmllms,
      title={Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?}, 
      author={Fei Lin and Ziyang Gong and Cong Wang and Yonglin Tian and Tengchao Zhang and Xue Yang and Gen Luo and Fei-Yue Wang},
      year={2025},
      eprint={2506.10912},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.10912}, 
}

About

πŸ“„ Preliminary version of ToxiMol-benchmark [arXiv:2506.10912](https://arxiv.org/abs/2506.10912)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages