This is a preliminary code release for the pocket-interacting foundation model "PocketXMol".
Please note that this code is a preview version and has yet to be cleaned up and refactored. It may be hard to read but should be functional for running. We will continue to improve the code and provide more detailed instructions in the future.
This repository is modified from the MolDiff repository (a good starting point for diffusion-based molecular generation). We thank the authors for their work.
Contents
To setup the environment on a Linux server, you can use Anaconda to create a new environment pxm from the environment.yml file (for CUDA 11.7) using the following commands (takes several minutes):
conda env create -f environment.yml
conda activate pxmIf you have a different CUDA version, you may need to modify the versions of the pytorch-related packages in the environment.yml file.
Or you can install the dependencies manually. For example, for CUDA 12.6, using the following command:
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install pytorch-lightning
pip install torch_geometric
pip install torch_scatter torch_sparse torch_cluster -f https://data.pyg.org/whl/torch-2.6.0+cu126.html
pip install biopython==1.83 rdkit==2023.9.3 peptidebuilder==1.1.0
pip install openbabel=3.1.1.1 # or, conda install -c conda-forge openbabel -y
pip install lmdb easydict==1.9 numpy==1.24 pandas==1.5.2
pip install tensorboardThe example data are included in the data/examples directory, which are used to demonstrate the usage in Sample for provided data.
The model weights for sampling are included in the file model_weights.tar.gz available from the Google Drive.
Download and extract it using the command:
tar -zxvf model_weights.tar.gzAfter extraction, there will be a directory named data/trained_model which contains the trained model weights for sampling.
For sampling for test sets, the processed test data and trained model weights are included in the file data_test.tar.gz available from the Google Drive.
Download and extract it using the command:
tar -zxvf data_test.tar.gzAfter extraction, there will be a directory named data which contains:
- test sets:
testfor benchmark related information;csdfor CrossDocked2020 set,geomfor GEOM-Drug set,moadfor Binding MOAD set,pepbdbfor PepBDB set,posebofffor PoseBusters set, andprotacdbfor PROTAC-DB set. - trained model weights in the
trained_modeldirectory for sampling. - example data files (in
examplesdirectory) for demonstrating the sampling for user-provided files.
For training, the demonstrative processed training data are in the file data_train_processed_reduced.tar.gz from the Google Drive.
The complete processed training data are too large (>500G) so we provide a reduced subset just to demonstrate the training process. Similarly, download and extract it using the command:
tar -zxvf data_train_processed_reduced.tar.gzThen there is a directory named data_training containing reduced training sets for demonstrative training.
If you want to train the model with the full training data, please follow the instructions in the process/process_steps.md file to process the raw data for complete training.
We provide interactive Colab notebooks for sampling. You can find the notebooks in the notebooks directory:
Here, we demonstrate some examples of sampling using the provided data in the data/examples directory.
Run the following command:
python scripts/sample_use.py \
--config_task configs/sample/examples/dock_smallmol.yml \
--outdir outputs_examples \
--device cuda:0NOTE:
- The batch size for sampling is defined in the configuration files. If the batch size is too large for your GPU memory, please reduce
batch_sizein the configuration files or directly set the batch size in the command line (e.g.,--batch_size 100).- After sampling, there will be a new directory in the specified
outdircontaining the generated results. The new directory is named as{exp_name}_{timestamp}whereexp_nameis created from the names of the configuration file andtimestampis the time when the experiment starts. Within it, the{exp_name}_{timestamp}_SDFsubdirectory contains the generated molecules, theSDFsubdirectory contains the sampling trajectory (if any), and filesgen_info.csvandlog.txtcontain the generation information.
The configuration files are in configs/sample/examples, including:
- Docking
-
dock_smallmol.yml: dock a small molecule to a protein pocket -
dock_smallmol_flex.yml: dock a small molecule to a protein pocket using flexible noise -
dock_smallmol_84663.yml: dock the molecule 84663 to caspase-9 -
dock_pep.yml: dock a peptide to a protein pocket -
dock_pep_fix_some.yml: dock, with fixed coordinates of some atoms -
dock_pep_know_some.yml: dock, with constrained coordinates of some atoms
-
- Small molecule design
-
sbdd.yml: design drug-like molecules for protein pocket -
sbdd_simple.yml: design drug-like molecules for protein pocket (no refinement rounds) -
growing_unfixed_frag.yml: fragment growing with unfixed fragment pose, i.e., design small molecules containing a specified fragment graph for protein pocket -
growing_fixed_frag.yml: fragment growing with fixed fragment pose, i.e., design small molecules containing a specified fragment with fixed pose for protein pocket
-
- Peptide design
-
pepdesign.yml: design peptides for protein pocket
-
- Design with customized settings:
-
pepdesign_hot136E: this directory considers a specific peptide design case. Based on the PD1-PDL1 complex (PDB ID: 3BIK), we found a hot spot residue 136E on PD1 interacting with PDL1. We aim to design a PDL1-binding peptide considering this interaction. We extract the protein fragment around 136E as the input peptide and the PDL1 chain as the target (indata/examples/hot136E). There are several strategies for designing peptdes (see Customized setting explanation for configuration explanation):-
fixed_Glu_CCOOH: design peptide whose 6th residue (count from 1) is Glu and its -CCOOH group pose is fixed as input. -
fixed_CCOOH: the designed peptide contains a -CCOOH group and its pose is fixed as input. But the -CCOOH may not be at the 6th residue and may not be in Glu (Asp can also contain -CCOOH). -
fixed_CCOOH_init0.9: the setting is the same asfixed_CCOOH, but the initial noisy peptide is generated by adding noise to the peptide of the input file, instead of being sampled from the noise prior. The only difference is the parameternoise/init_step. Hint: by settingnoise/init_step$< 1$ , the initial noisy coordinates will be sampled from the Gaussian noise with mean equal to the input coordinates instead of the noise space center. -
unfixed_Glu: design peptides with Glu at the 6th residue. No atom coordinates are fixed. -
unfixed_CCOOH: design peptides containing -CCOOH group (i.e., containing Glu or Asp), but the -CCOOH group can be at any residue index. No atom coordinates are fixed. -
unfixed_CCOOH_from_inputs: the setting is the same asunfixed_CCOOH, but the initial pose of the -CCOOH group are sampled based on the input peptide. This align with our intuition that the -CCOOH group in the designed peptide should interact with the protein in a similar way as the input one.
-
-
More examples are on the way.
The self-confidence scores are in the gen_info.csv (column cfd_traj) file produced during the sampling process. To calculate other confidence scores for the generated molecules, use the command like this:
python scripts/believe_use_pdb.py \
--exp_name pepdesign_pxm_20210101_150132 \
--result_root outputs_use \
--config configs/sample/confidence/tuned_cfd.yml \
--device cuda:0The parameters:
result_rootis the directory containing the sampling experiments (equal to the parameteroutdirof the sampling command).exp_nameis the name of the sampling experiment directory (looks likepepdesign_pxm_20210101_150132). If there is only one experiment with the name starting with theexp_name, the appended timestamp can be omitted (i.e., set aspepdesign_pxm).configis the confidence model configuration file. They are inconfigs/sample/confidenceincluding:tuned_cfd.yml: the tuned confidence predictorflex_cfd.yml: using original model with flexible noise for confidence prediction
After running, the .csv files of confidence scores will be saved at the ranking sub-directory.
You can refer to these configuration files and adapt to your own data and tasks. Here are some simple explanations of the configuration.
Typically there are five main blocks: sample, data, transforms, task, and noise. The first three keys define the data and sampling parameters, and the last two define the task.
In most cases, you only need to find a task template configuration file and modify the first three blocks.
sample: the sampling parameters, including base random seed, batch size, and the number of generated molecules. The parametersave_traj_probmeans the frequency of saving the generation trajectories.data: the input data, includingprotein_path: the path to the protein PDB file.input_ligand: the information of input ligand.- For small molecule, it can be a SDF file path, the SMILES string or
None(for de novo small mol design). - For peptide, it can be the PDB file path, the sequence string (with prefix
pepseq_, e.g., pepseq_DTVFALFW, for docking) or the sequence length (with prefixpeplen_, e.g., peplen_10, for de novo design).
- For small molecule, it can be a SDF file path, the SMILES string or
is_pep: bool, whether the ligand is peptide. It is used to create the PDB files for the generated molecules. If not set, it will be automatically determined according toinput_ligand.pocket_args: dict of pocket parameters, includingref_ligand_path: path to the reference molecule file (SDF or PDB). This molecule is used to determine the pocket from the complete protein, i.e., the residues within a certain distance to the reference molecule are defined as pocket residues. Exclusive topocket_coord.pocket_coord: the coordinate of the pocket. The pocket will be defined as the residues near the coordinate. Exclusive toref_ligand_path. If neitherref_ligand_pathnorpocket_coordis set, it will useinput_ligandas reference.radius: the residues within the radius to the reference ligand or the pocket coordinate are defined as pocket residues. Default is 10.criterion: the criterion to define the residue distance, be one of ['center_of_mass', 'min']. Default is 'center_of_mass'.
pocmol_args: user-defined identifiers. Not important.data_idpdbid
transforms(optional): the extra data processing parameters, includingfeaturizer_pocket:center: coordinate space center for denoising. It influences sampling atom coordinates from the Gaussian noise at the first step. If not set, it will be automatically defined as the average of pocket atom coordinates (You can also usefeaturizer/mol_as_pocket_centerto specify the pocket center). This is useful when you have prior knowledge of space for generation. For example, for linker design, you can set the center as the midpoint of the two fragments to be linked.
featurizermol_as_pocket_center: bool, use the center coordinates of the ligand as the space center. If set toTrue, the parameterdata/pocket_args/input_ligandshould be SDF/PDB file. (You can also usefeaturizer_pocket/centerto specify the pocket center)
variable_mol_size: distributions of the number of atoms for small-molecule designing tasks. It will automatically add or remove atoms from the input ligand. Remember to set itsnot_removeparameter if you want to exclude some atoms from being removed (see example usage ingrowing_fixed_frag.ymlandgrowing_unfixed_frag.yml).variable_sc_size: distributions of number of side-chain atoms for peptide designing. The default value should work well.
task: the task and its specific mode.noise: the task nosie parameters.
Here we explain the customized settings in the examples/pepdesign_hot136E directory.
In these settings, we defined a task called custom. (Basically, all the previous common tasks can be expressed through this custom task.)
The basic idea is to (1) define several groups of noise, (2) partition the molecules into several parts, and (3) map the noise groups to the molecule parts.
In their config files, the sample and data blocks are the same as the common tasks. For other blocks:
-
transforms: similar as the common tasks, but with some additional settings:- The
variable_sc_size/applicable_tasksshould contain the task namecustom. - Some side-chain atoms of input peptides will be randomly removed for variable sizes. Set
variable_sc_size/not_removeto exclude side-chain atoms from being removed. This is a list of atom indices in the input peptide (starting from 0).
- The
-
task: Intask/transform, please define:-
is_peptide: wheter the task is related to peptide or small molecule. This is the prompt$\mathbf{P}^{\text{pep}}$ in the paper. -
partition: this is where you define how you partition the molecule. It is a list of dictionaries, and each dictionary contains:-
name: the name of the part. -
nodes: the atom indices of the part. The atom indices are 0-based.
-
-
fixed: define which variables are fixed as input, including:-
node: list of molecular parts whose atom types are fixed. -
pos: list of molecular parts whose atom coordinates are fixed. -
edge: list of molecular part pairs whose inner bond types are fixed.
-
-
-
noise:-
num_steps: the number of sampling steps. It is an integer and$100$ should work well. -
init_step: the initial step of noise. It is a scalar in$(0, 1]$ and default is$1$ . During the sampling, the step will decay frominit_stepto$0$ linearly. Larger value means more noise. If it is set as$1$ , the initial noisy molecule will be sampled from the noise prior without considering the input molecules. Specifically, the coordinates will be sampled from the Gaussian noise with mean equal to the noise space center. If it is less than$1$ (and the parameterfrom_priorin the noise group is not set asFalse(default)), the initial noisy coordinates will be sampled from the Gaussian noise with mean equal to the input coordinates instead of the noise space center. -
prior: define the noise prior distributions for different noise groups. This is a dictionary, and each key is the noise group name and the value is the noise prior distribution. Tips:- For each noise group, define the noise prior distributions for
node(atom type),pos(atom coordinate), andedge(bond type). You can refer to the noise prior settings in the training configuration file (configs/train/train_pxm_reduced.yml) for reference. - If there is only
posnoise, you can setpos_onlyasTrue. - Set
from_priorasTrue(default) to sample the initial noisy coordinates completely from the noise prior. If you want to consider the input coordinates, you can setfrom_priorasFalseto disable the initial noisy coordinates sampling from the noise prior but based on the input coordinates (seeunfixed_CCOOH_from_inputs.yml) even ifnoise/init_stepis set as$1$ . This is useful when some atom coordinates of the input molecule can provide a good starting point for the generation or their approximate coordinates are known.
- For each noise group, define the noise prior distributions for
-
level: define information level strategies for different noise group. Information level strategy controls the noise scale at each step, i.e., it is a mapping from the step to the information level (within the interval$[0,1]$ , information level is$1-$ noise level). Tips:- You can refere to the level settings in the training configuration file (
configs/train/train_pxm_reduced.yml) for reference. - Usually the uniform level should work well for de novo generation. If you want to preserve more information of the input file, you can set the
minlevel as a larger value.
- You can refere to the level settings in the training configuration file (
-
mapper: define the mapping from the noise groups to the molecule parts. This is a dictionary, and each key is the noise group name and the value is the molecule part name of the variablesnode(atom type),pos(atom coordinate), andedge(bond type).
-
We provide the configuration files for sampling in the test sets of individual tasks.
NOTE:
- The batch size for sampling is defined in the configuration files. They were verified on an 80G A100 GPU. If the batch size is too large for your GPU memory, please reduce
batch_sizein the configuration files or directly set the batch size in the command line (e.g.,--batch_size 100).- Typical running time for individual test sets is around 1 ~ 6 hours on a single A100 GPU.
- After sampling, there will be a new directory in the specified
outdircontaining the generated results. The new directory is named as{exp_name}_{timestamp}whereexp_nameis created from the names of the configuration file andtimestampis the time when the experiment starts. Within it, theSDFsubdirectory contains the generated molecules, and filesgen_info.csvandlog.txtcontain the generation information.
Sample docking poses for 428 pairs of protein and small-molecule in the PoseBusters set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/dock_poseboff/base.yml \
--outdir outputs_test/dock_posebusters \
--device cuda:0The task configuration files are in configs/sample/test/dock_poseboff.
Configuration files include:
base.yml: dock using Gaussian noise (default)base_flex.yml: dock using flexible noise
prior_center.yml: dock with prior knowledge of the molecular centerprior_bond_length.yml: dock with prior knowledge of bond lengthprior_anchor.yml: dock with prior knowledge of approximate anchor atom coordinateprior_fix_anchor.yml: dock with fixed anchor atom coordinate
The self-confidence scores are in the gen_info.csv (column cfd_traj) file produced during the sampling process. To calculate other confidence scores for the generated molecular poses, use the following command:
python scripts/believe.py \
--exp_name base_pxm \
--result_root outputs_test/dock_posebusters \
--config configs/sample/confidence/tuned_cfd.yml \
--device cuda:0The parameters:
result_rootis the directory containing the sampling experiments (equal to the parameteroutdirof the sampling command).exp_nameis the name of the sampling experiment directory (looks likebase_pxm_20241030_225401). If there is only one experiment with the name starting with theexp_name, the appended timestamp can be omitted (i.e., set asbase_pxm).configis the confidence model configuration file. They are inconfigs/sample/confidenceincluding:tuned_cfd.yml: the tuned confidence predictorflex_cfd.yml: using original model with flexible noise for confidence prediction
To get the ranking scores for pose selection, after obtaining the confidence scores, use the following command:
python scripts/rank_pose.py \
--exp_name base_pxm \
--result_root outputs_test/dock_posebusters \
--db poseboffto produce the ranking.csv file which contains the self_ranking and tuned_ranking columns as ranking scores.
Sample docking poses for 79 pairs of protein and peptide in the peptide docking test set.
python scripts/sample_pdb.py \
--config_task configs/sample/test/dock_pepbdb/base.yml \
--outdir outputs_test/dock_pepbdb \
--device cuda:0The task configuration files are in configs/sample/test/dockpep_pepbdb.
Configuration files include:
base.yml: dock using Gaussian noise (default)base_flex.yml: dock using flexible noiseprior_fix_anchor.yml: dock with fixed anchor atom coordinateprior_fix_first_residue.yml: dock with fixed first residue atom coordinatesprior_fix_terminal_residue.yml: dock with fixed both terminal residue atom coordinatesprior_fix_backbone.yml: dock with fixed backbone atom coordinates
Sample molecular conformations for the 199 molecules in the conformation test set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/conf_geom/base.yml \
--outdir outputs_test/conf_geom \
--device cuda:0Sample drug-like molecules for the 100 protein pockets in the SBDD test set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/sbdd_csd/base.yml \
--outdir outputs_test/sbdd_csd \
--device cuda:0The task configuration files are in configs/sample/test/sbdd_csd.
Configuration files include:
base.yml: sbdd using refine-based sampling strategy (default)ar.yml: sbdd using an auto-regressive-like sampling strategysimple.yml: sbdd with only one generation round, not using confidence scores for samplingbase_mol_size.yml: sbdd using refine-based sampling strategy with molecular sizes determined from reference molecules
Generate drug-like molecules with the sizes as the GEOM-Drug validation set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/denovo_geom/base.yml \
--outdir outputs_test/denovo_geom \
--device cuda:0The task configuration files are in configs/sample/test/denovo_geom.
Configuration files include:
base.yml: molecule generation using refine-based sampling strategy (default)ar.yml: molecule generation using an auto-regressive-like sampling strategysimple.yml: molecule generation with only one generation round, not using confidence scores for sampling
Design molecules by linking fragments for the 416 pairs of proteins and fragments in the fragment linking test set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/linking_moad/known_connect.yml \
--outdir outputs_test/linking_moad \
--device cuda:0The task configuration files are in configs/sample/test/linking_moad.
Configuration files include:
known_connect.yml: fragment linking with known connecting atoms of fragmentsunknown_connect.yml: fragment linking with unknown connecting atoms of fragments
Design PROTAC molecules by linking fragments for the 43 fragment pairs in the PROTAC-DB test set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/linking_protacdb/fixed_fragpos.yml \
--outdir outputs_test/linking_protacdb \
--device cuda:0The task configuration files are in configs/sample/test/linking_protacdb.
Configuration files include (all assume known connecting atoms of fragments):
fixed_fragpos.yml: fragment linking with fixed fragment posesunfixed_lv0.yml-unfixed_lv4.yml: fragment linking with unfixed fragment poses. The input fragment poses were derived by randomly perturb the true fragment poses with different levels of noise (lv0=smallest).
Design molecules through growing fragments for the 53 pairs of fragment and protein in the fragment growing test set.
python scripts/sample_drug3d.py \
--config_task configs/sample/test/growing_csd/base.yml \
--outdir outputs_test/growing_csd \
--device cuda:0The task configuration file is configs/sample/test/growing_csd/base.yml.
Design peptides for the 35 protein pockets in the peptide design test set.
python scripts/sample_pdb.py \
--config_task configs/sample/test/pepdesign_pepbdb/base.yml \
--outdir outputs_test/pepdesign_pepbdb \
--device cuda:0The task configuration file is configs/sample/test/pepdesign_pepbdb/base.yml.
Design peptides for the 35 pairs of backbone structures and protein pockets in the peptide design test set.
python scripts/sample_pdb.py \
--config_task configs/sample/test/pepinv_pepbdb/base.yml \
--outdir outputs_test/pepinv_pepbdb \
--device cuda:0The task configuration file is configs/sample/test/pepinv_pepbdb/base.yml.
Make sure to download and extract the training data data_training_processed_reduced.tar.gz as described in the Data and model weights section.
Then run the following command to train the model with reduced data:
python scripts/train_pl.py --config configs/train/train_pxm_reduced.yml --num_gpus 1You can specify the number of GPUs to use by setting the num_gpus parameter.
The training configuration file is defined in configs/train/train_pxm_reduced.yml.
You can change the batch_size parameter in the configuration file to adjust to your GPU memory.
If you want to train the model with the full training data, please follow the instructions in the Raw data and processing steps section to process the raw data for training. Then, modify data.dataset.root and data.dataset.assembly_path in the training configuration file to point to the full training data directory and run the training command as above.