This repository contains the codebase and results for GOLF:
- Multiple Sequence Alignment (MSA) generation from protein sequence data.
- Phylogenetic analysis to understand evolutionary relationships.
- Training and fine-tuning protein language models:
- Evolutionary model of Variant Effect (EVE) ensemble.
- ESM-1b model.
- Interpreting variant effects using Sparse Autoencoders (SAEs) on ESM2 embeddings.
The overall process is detailed below:
- Sequence Collection: Homologous sequences are collected using
jackhmmeragainst the UniRef50 database. - MSA Creation: After removing duplicate sequences,
MMSeqs2(specifically theeasy-clustercommand) is employed to generate a Multiple Sequence Alignment (MSA). This MSA is constructed with a sequence identity threshold of 95% and coverage of 80%. - This MSA serves as the primary input for training the EVE models and fine-tuning the ESM-1b model.
- A phylogenetic tree is constructed to analyze the evolutionary context of the sequences, using an MSA clustered at 80% sequence identity.
- The
phylo/directory houses all relevant scripts and data for this stage. - Tree Construction: IQ-TREE is used for building the phylogenetic tree. The specific command can be found in
phylo/README.md.- Input:
phylo/MSA_80cluster.fas - Output:
phylo/MSA_80cluster.treefile(raw tree),phylo/MSA_80cluster_cleaned.treefile(processed tree).
- Input:
- Processing and Visualization: The script
phylo/clean_tree.pystandardizes leaf names in the tree and generates annotation files (e.g.,phylo/dataset_colorstrip.txt) for enhanced visualization with iTOL (Interactive Tree Of Life).
- The Evolutionary model of Variant Effect (EVE) is utilized to predict variant effects.
- Model Training:
- The core Variational Autoencoder (VAE) for EVE models is trained using a script analogous to
train_VAE.py, taking the MSA as input. - Evolutionary indices, a key output from EVE models indicating variant impact, are computed from the VAE's latent space using a script like
compute_evol_indices.py.
- The core Variational Autoencoder (VAE) for EVE models is trained using a script analogous to
- Ensemble Creation: An ensemble of EVE models is created by training multiple models, typically with different random seeds. These models, along with their configurations (
model_params.json), checkpoints, logs, and computedevol_indices, are stored in subdirectories withinEVE Ensemble/(e.g.,OLF-40_seed100_seed100_theta0.25_ld40_lr0.0001/). - Ensemble Performance Assessment: The
Ensemble Analysis/directory contains scripts for evaluating the EVE ensemble:ensemble_evol_indices_analysis.py: This script aggregatesevol_indicesfrom individual models in the ensemble. It then employs a Gaussian Mixture Model (GMM) analysis usingtrain_GMM_and_compute_EVE_scores.pyto convert evolutionary indices into pathogenicity scores and calculates prediction accuracy against a ground truth set of mutations.visualize_ensemble_results.py: Generates various plots to illustrate the ensemble's accuracy, its improvement over individual models, and other performance metrics.- Analysis results, including plots and summary CSV files, are saved in this directory.
- Relevant data sources are found in
/EVE Datato run the aformentioned scripts.
- The ESM-1b protein language model is fine-tuned on the OLF MSA to adapt it for variant effect prediction.
- Fine-tuning Process: The script
fine_tune_esm1b.pymanages this process. It loads the MSA, freezes the initial layers of the pre-trained ESM-1b model, and fine-tunes the subsequent layers. - Outputs: The fine-tuned model checkpoints (e.g.,
esm1b_finetuned.pt,best_model.pt), training logs, and related plots are stored in theESM/ESM1b/directory.
- To understand the features learned by large protein models and how they relate to variant effects, a Sparse Autoencoder (SAE) is applied to ESM2 embeddings.
- The
SAE/directory contains all scripts, configuration files, and detailed instructions for this analysis (seeSAE/README.md). - Probing SAE Latents:
probe_sae.py: This script computes SAE activations from mean-pooled ESM2 embeddings (specifically from layer 24) for a set of variants. It then trains a linear regression model (Ridge regression) to predict EVE scores based on these SAE activations. The weights of this linear model indicate which SAE latent dimensions are most predictive of pathogenicity.
- Visualization:
visualize_sae.py: Identifies the top SAE latent dimensions (both pathogenic and benign-associated) based on the probe weights. It generates:- A PyMOL script (
highlight_units_layer24.pml) to visualize these latents and their associated residues on the protein structure. - A text file (
highlighted_latents_layer24.txt) summarizing residue-level associations for the top latents.
- A PyMOL script (
- Input data for this step typically includes a list of mutated sequences and their corresponding EVE scores (e.g.,
SAE/mutated_sequences_with_scores.csv).
Key directories and files in this repository:
README.md: This file.fine_tune_esm1b.py: Script for fine-tuning the ESM-1b model.train_VAE.py: Script for training the VAE component of EVE models.compute_evol_indices.py: Script to compute evolutionary indices from trained EVE models.phylo/: Contains scripts, data, and README for phylogenetic analysis.clean_tree.py: Processes phylogenetic trees and generates iTOL annotations.
EVE Ensemble/: Stores trained EVE models from multiple runs/seeds.- Each subdirectory contains model parameters, checkpoints, logs, and evolutionary indices.
Ensemble Analysis/: Scripts and results for EVE ensemble performance analysis.ensemble_evol_indices_analysis.py: Core script for ensemble evaluation.visualize_ensemble_results.py: Generates plots for ensemble performance.
ESM/: Contains fine-tuned ESM model artifacts.ESM1b/: Fine-tuned ESM-1b model checkpoints, logs, and plots.
SAE/: Scripts, data, and detailed README for Sparse Autoencoder analysis.sae.yml: Conda environment definition for SAE tasks.probe_sae.py: Script for training a linear probe on SAE embeddings.visualize_sae.py: Script for visualizing predictive SAE latents.
utils/: Helper functions for EVE.data/: Data used to construct MSA.examples/: Example script runs for EVE.
General setup guidelines. For specific modules like SAE, refer to their dedicated README files (e.g., SAE/README.md).
- Core Tools:
- Ensure
jackhmmr(from the HMMER suite) andMMSeqs2are installed and available in your system's PATH.
- Ensure
- Phylogenetic Analysis:
- IQ-TREE: Required for phylogenetic tree construction. Download from the IQ-TREE website.
- Python: A Python environment with
pandasis needed forphylo/clean_tree.py.
- EVE Model Training & Analysis:
- The EVE framework relies on Python with standard scientific libraries (NumPy, Pandas) and PyTorch. Ensure these are installed.
- Refer to EVE documentation for specific version requirements if available.
- ESM-1b Fine-tuning:
- Python Environment: Requires PyTorch and the
esmlibrary by Facebook Research.pip install torch fair-esm matplotlib pandas tqdm
- Hardware: A GPU is highly recommended for efficient fine-tuning.
- Python Environment: Requires PyTorch and the
- SAE Analysis:
- A dedicated Conda environment is specified in
SAE/sae.yml. Create and activate it:conda env create -f SAE/sae.yml conda activate sae
- Follow instructions in
SAE/README.mdto clone theInterProtrepository and download necessary ESM2 and SAE models, placing them inSAE/models/.
- A dedicated Conda environment is specified in
This section provides guidance on executing the different stages of the analysis pipeline.
- Use
jackhmmrto search against UniRef50 and gather sequences. - Process the output to remove duplicates.
- Use
MMSeqs2 easy-cluster(with 95% identity, 80% coverage) to generate the MSA. This MSA (e.g.,your_msa.a3moryour_msa.fasta) will be used in subsequent steps. - Prepare a version of the MSA for phylogenetic analysis (e.g., clustered at 80% identity,
phylo/MSA_80cluster.fas).
- Navigate to the
phylo/directory. - Ensure your MSA for phylogeny (e.g.,
MSA_80cluster.fas) is present. - Run IQ-TREE using the command specified in
phylo/README.md. - Execute
python clean_tree.pyto process the output tree and generate annotation files. - Upload the cleaned treefile (e.g.,
MSA_80cluster_cleaned.treefile) and annotation files (e.g.,dataset_colorstrip.txt) to iTOL for visualization.
-
Training an EVE Model:
# Train the VAE component python train_VAE.py --msa_file path/to/your_msa.a3m \\ --output_dir path/to/your_eve_model_dir \\ ... # Compute evolutionary indices python compute_evol_indices.py --model_checkpoint path/to/your_eve_model_dir/checkpoints/best_model.pt \\ --msa_file path/to/your_msa.a3m \\ --output_file path/to/your_eve_model_dir/evol_indices/evol_indices.csv \\ ...
Repeat with different seeds/configurations for ensemble members, storing outputs in
EVE Ensemble/. -
Ensemble Analysis:
- Navigate to
Ensemble Analysis/. - Ensure paths to individual model
evol_indicesfiles (withinEVE Ensemble/) are correctly referenced or configured withinensemble_evol_indices_analysis.py. - Provide ground truth mutation data if required by the script.
- Run the analysis:
python ensemble_evol_indices_analysis.py
- Generate plots using the output from the analysis:
python visualize_ensemble_results.py --results_file path/to/ensemble_results.csv
- Navigate to
- Prepare your MSA file (e.g.,
your_msa.a3moryour_msa.fasta). - Run the fine-tuning script:
The fine-tuned model and logs will be saved in the specified output directory.
python fine_tune_esm1b.py --msa_file path/to/your_msa.a3m_or_fasta \\ --output_dir ESM/ESM1b/my_finetuned_model \\ --model_name facebook/esm1b_t33_650M_UR50S \\ --epochs 5 \\ --batch_size 1 \\ --learning_rate 1e-5 \\ --num_frozen_layers 30 ...
- Activate the
saeconda environment:conda activate sae. - Ensure ESM2 and SAE models are downloaded and correctly placed in
SAE/models/as perSAE/README.md. - Navigate to the
SAE/directory. - Prepare your input data:
mutated_sequences_with_scores.csv(containing sequence variants and their EVE scores or other pathogenicity labels). - Run the SAE probing script:
This will generate files like
python probe_sae.py
results/weights/sae_raw_layer24.csv. - Run the visualization script:
This generates
python visualize_sae.py
highlight_units_layer24.pmlfor PyMOL andhighlighted_latents_layer24.txt.