This repository contains code and resources for predicting the cytotoxicity of both individual chemical compounds and their mixtures using various machine learning approaches. The project leverages different molecular representation techniques and machine learning models to achieve accurate toxicity predictions.
This directory contains code for a machine learning model that predicts cytotoxicity using pre-trained ChemBERTa embeddings as molecular representations.
Key Files:
8020_positive-only.ipynb: Jupyter notebook implementing a Random Forest Regression model for cytotoxicity predictionformula_descriptors.py: Module for calculating embeddings for chemical mixtures based on individual compound embeddings80_20_level_model_pretrained.pkl: Saved trained modelREADME_ChemBERTA-pre-trained.md: Detailed documentation
Data:
- Chemical embeddings generated from ChemBERTa
- Cytotoxicity data for individual compounds and mixtures
- Mixture composition information
Dependencies:
- pandas, numpy, scikit-learn, matplotlib, pickle
Approach:
- Uses pre-trained ChemBERTa embeddings (384-dimensional) to represent chemical compounds
- Calculates mixture embeddings as weighted sums of individual compound embeddings
- Implements a Random Forest Regression model with hyperparameter tuning
- Evaluates performance using R², MAE, and RMSE metrics
This directory contains code for a cytotoxicity prediction model using molecular descriptors generated from the R Chemistry Development Kit (RCDK).
Key Files:
formula_descriptors.py: Module for calculating descriptors for chemical mixturesREADME_RCDK.md: Detailed documentation
Dependencies:
- pandas, numpy, scikit-learn, matplotlib, pickle
Approach:
- Uses RCDK-generated molecular descriptors to represent chemical compounds
- Calculates mixture descriptors as weighted sums of individual compound descriptors
- Implements a Random Forest Regression model with hyperparameter tuning
- Evaluates performance using R², MAE, and MSE metrics
This directory contains R scripts for generating molecular descriptors using the R Chemical Development Kit.
Key Files:
RCDK.R: R script for calculating molecular descriptorskim_single_ingredient.csv: Input data with compound names and SMILES stringsKIM_rcdk_descriptors.csv: Output file with calculated descriptorsREADME_RCDK.md: Detailed documentation
Dependencies:
- rcdk, readr, mlbench, caret, dplyr, pROC
Functionality:
- Imports chemical compound data (names and SMILES representations)
- Calculates molecular descriptors using RCDK
- Cleans and normalizes descriptor data
- Exports processed data for further analysis
This directory contains resources for fine-tuning the ChemBERTa model specifically for liver toxicity prediction.
Key Files:
chemberta-liver.ipynb: Notebook for fine-tuning ChemBERTa on liver toxicity dataAID_1224867_datatable_hepg2_24h.csv: PubChem bioassay data for HepG2 cell line toxicity (24h exposure)AID_1224879_datatable_hepg2_40h.csv: PubChem bioassay data for HepG2 cell line toxicity (40h exposure)README.md: Detailed documentation
Dependencies:
- PyTorch, Transformers, scikit-learn, pandas
Approach:
- Uses DeepChem/ChemBERTa-77M-MLM model
- Fine-tunes for binary classification of liver toxicity
- Processes SMILES strings using ChemBERTa's tokenizer
- Evaluates using accuracy and binary cross-entropy metrics
This directory contains visualization outputs and possibly code related to ChemBERTa models trained on PubChem data.
Contents:
- Visualization figures for various chemical compounds and mixtures
This directory likely contains pre-trained embeddings for chemical compounds, used as inputs for the machine learning models.
The ToxMix project aims to develop accurate predictive models for chemical toxicity, with a particular focus on predicting the cytotoxicity of chemical mixtures. The project employs multiple approaches:
- ChemBERTa Embeddings: Using transformer-based molecular representations from pre-trained ChemBERTa models
- RCDK Descriptors: Using traditional molecular descriptors calculated with the R Chemistry Development Kit
- Fine-tuned ChemBERTa: Specifically fine-tuning ChemBERTa for liver toxicity prediction
A key innovation in this project is the methodology for representing chemical mixtures, which uses a weighted sum approach based on the mole fractions of constituent compounds.
Each subdirectory contains its own README with specific instructions for using the code and models in that directory. In general, the workflow involves:
- Generating molecular representations (either ChemBERTa embeddings or RCDK descriptors)
- Training machine learning models (typically Random Forest Regression)
- Evaluating model performance on test data
- Using trained models to predict toxicity of new compounds or mixtures
The project uses a combination of Python and R libraries:
Python:
- pandas, numpy, scikit-learn, matplotlib, pickle
- PyTorch, Transformers (for ChemBERTa fine-tuning)
R:
- rcdk, readr, mlbench, caret, dplyr, pROC
The project uses several types of data:
- Chemical structures represented as SMILES strings
- Molecular descriptors and embeddings
- Cytotoxicity measurements for individual compounds and mixtures
- Mixture composition information (compound names and mole fractions)
The models achieve good performance in predicting cytotoxicity for both individual compounds and mixtures, as demonstrated by various evaluation metrics and visualization plots included in the respective directories.