SpecMER is a speculative decoding framework for accelerated protein sequence generation using k-mer distributions from multiple sequence alignments (MSAs).
K-mer analysis and distribution generation from MSA files.
Key Functions:
read_msa_file()- Parse MSA files (FASTA/A2M formats)extract_kmers()- Extract k-mers from protein sequencesanalyze_kmers()- Generate k-mer frequency statisticscreate_softmax_distribution()- Convert frequencies to probability distributionscreate_distributions()- Generate normalized and raw k-mer distributions
Protein configuration management for different protein targets.
Features:
ProteinConfigdataclass for protein-specific parametersget_protein_config()- Retrieve configurations for supported proteins:- GFP (context_length=20, max_tokens=219)
- RBP1 (context_length=10, max_tokens=42)
- CBS (context_length=50, max_tokens=501)
- ADRB2 (context_length=40, max_tokens=373)
- F7YBW8, GB1, Q59976
- Integration with MLP evaluation models
Main experimental runner for SpecMER protein generation.
Key Function:
run_kmer_specme_experiment()- Execute speculative decoding experiments
Parameters:
protein_name- Target protein (GFP, RBP1, CBS, etc.)num_candidates- Number of draft candidates (1, 2, 3, 5)n_draft_tokens- Draft sequence length (5, 10, 15)temperature- Sampling temperature (0.7, 1.0, 1.4)k_values- K-mer sizes for distributions (e.g., [1,3], [1,3,5])
Quick demonstration script for SpecMER functionality.
Default Configuration:
- Protein: GFP
- K-values: [1,3]
- Temperature: 1.0
- Draft tokens: 5
- Candidates: 3
- Generates 3 example sequences
python specmer_demo.pypython run_kmer_specmer.py \
--protein GFP \
--num_candidates 5 \
--n_draft_tokens 10 \
--temperature 1.0 \
--k_values 1,3,5 \
--num_experiments 50python kmer.py \
--msa_file data/GFP_AEQVI_full_04-29-2022_b08.a2m \
--output_dir kmers/kmers_GFP \
--k_values 1,3,5/analysis/- Analysis and evaluation scripts/data/- MSA files and protein sequences/kmers/- Pre-computed k-mer distributions/plotting/- Visualization scripts/util/- Utility functions
- PyTorch
- BioPython
- NumPy
- Pandas
- ProGen2 models
- CUDA-compatible GPU (recommended)
| Protein | Length | Context | MSA Depth |
|---|---|---|---|
| GFP | 238 aa | 20 | 396 |
| RBP1 | 52 aa | 10 | 135,922 |
| CBS | 551 aa | 50 | 19,563 |
| ADRB2 | 413 aa | 40 | 204,722 |
| F7YBW8 | 93 aa | 10 | 38,613 |
| GB1 | 55 aa | 10 | 44 |
| Q59976 | 501 aa | 50 | 105,913 |
Experiments generate CSV files containing:
- Generated protein sequences
- Acceptance/rejection statistics
- Generation timing metrics
- K-mer scoring information