Skip to content

amirgroup-codes/SpecMER

Repository files navigation

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

SpecMER Icon

SpecMER is a speculative decoding framework for accelerated protein sequence generation using k-mer distributions from multiple sequence alignments (MSAs).

Key Components

Core Scripts

kmer.py

K-mer analysis and distribution generation from MSA files.

Key Functions:

  • read_msa_file() - Parse MSA files (FASTA/A2M formats)
  • extract_kmers() - Extract k-mers from protein sequences
  • analyze_kmers() - Generate k-mer frequency statistics
  • create_softmax_distribution() - Convert frequencies to probability distributions
  • create_distributions() - Generate normalized and raw k-mer distributions

protein.py

Protein configuration management for different protein targets.

Features:

  • ProteinConfig dataclass for protein-specific parameters
  • get_protein_config() - Retrieve configurations for supported proteins:
    • GFP (context_length=20, max_tokens=219)
    • RBP1 (context_length=10, max_tokens=42)
    • CBS (context_length=50, max_tokens=501)
    • ADRB2 (context_length=40, max_tokens=373)
    • F7YBW8, GB1, Q59976
  • Integration with MLP evaluation models

run_kmer_specmer.py

Main experimental runner for SpecMER protein generation.

Key Function:

  • run_kmer_specme_experiment() - Execute speculative decoding experiments

Parameters:

  • protein_name - Target protein (GFP, RBP1, CBS, etc.)
  • num_candidates - Number of draft candidates (1, 2, 3, 5)
  • n_draft_tokens - Draft sequence length (5, 10, 15)
  • temperature - Sampling temperature (0.7, 1.0, 1.4)
  • k_values - K-mer sizes for distributions (e.g., [1,3], [1,3,5])

specmer_demo.py

Quick demonstration script for SpecMER functionality.

Default Configuration:

  • Protein: GFP
  • K-values: [1,3]
  • Temperature: 1.0
  • Draft tokens: 5
  • Candidates: 3
  • Generates 3 example sequences

Usage

Quick Demo

python specmer_demo.py

Full Experiment

python run_kmer_specmer.py \
    --protein GFP \
    --num_candidates 5 \
    --n_draft_tokens 10 \
    --temperature 1.0 \
    --k_values 1,3,5 \
    --num_experiments 50

Generate K-mer Distributions

python kmer.py \
    --msa_file data/GFP_AEQVI_full_04-29-2022_b08.a2m \
    --output_dir kmers/kmers_GFP \
    --k_values 1,3,5

Directory Structure

  • /analysis/ - Analysis and evaluation scripts
  • /data/ - MSA files and protein sequences
  • /kmers/ - Pre-computed k-mer distributions
  • /plotting/ - Visualization scripts
  • /util/ - Utility functions

Requirements

  • PyTorch
  • BioPython
  • NumPy
  • Pandas
  • ProGen2 models
  • CUDA-compatible GPU (recommended)

Supported Proteins

Protein Length Context MSA Depth
GFP 238 aa 20 396
RBP1 52 aa 10 135,922
CBS 551 aa 50 19,563
ADRB2 413 aa 40 204,722
F7YBW8 93 aa 10 38,613
GB1 55 aa 10 44
Q59976 501 aa 50 105,913

Output

Experiments generate CSV files containing:

  • Generated protein sequences
  • Acceptance/rejection statistics
  • Generation timing metrics
  • K-mer scoring information

About

Official code repo for SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages