News 🔥
- [2025/10] CSV-Decode: Novel approach for efficient LLM inference using geometric upper bounds
- [2025/09] 2-4.95x speedup over autoregressive baseline with provable correctness guarantees
- [2025/09] compatibility with LLM architecture without retraining
TL; DR: we introduce CSV-Decode (Certifiable Sub-Vocabulary Decoding) to address the output layer bottleneck in Large Language Models (LLMs). CSV-Decode uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees. In summary, our CSV-Decode is:
- 🔥 2.67-4.95x speedup over autoregressive baseline across multiple models and tasks
- 99.3% quality retention with <2% fallback rates to full vocabulary computation
- 52% energy reduction compared to baseline inference
- Provably correct: Exact top-k certification and ε-certified softmax approximations
- Training-free: Works with any pre-trained LLM without retraining
- Universal compatibility: Supports all major LLM architectures (Llama, CodeLlama, Qwen, DeepSeek, etc.)
- Advanced optimizations: Sparse GEMV kernels, multi-GPU sharding, and CUDA Graph integration
CSV-Decode addresses the fundamental bottleneck in LLM inference: the expensive output layer computation over large vocabularies. Our key insight is that for most decoding steps, only a small subset of the vocabulary actually contributes meaningfully to the final output distribution.
CSV-Decode leverages geometric reasoning to identify which tokens can be safely omitted from computation:
- Vocabulary Clustering: Offline K-means clustering groups semantically similar tokens
- Cauchy-Schwarz Bounds: Derives tight upper bounds on logits for entire clusters
- Certification Mechanisms: Provides exact top-k certification and ε-certified softmax approximations
- Adaptive Algorithm: Dynamically expands sub-vocabularies based on certification criteria
For any token i in cluster c, we bound the logit using Cauchy-Schwarz inequality:
ℓi(t) = ⟨Wi, ht⟩ + bi ≤ ⟨μc, ht⟩ + Rc∥ht∥2 + max_j∈c bj
Where:
μc: Cluster centroidRc: Cluster radiusht: Hidden state at step t
This enables us to skip entire clusters without computing individual token logits, while maintaining provable correctness guarantees.
Follow the instructions below to prepare for reproducing the results in the paper.
- experimental environment:
sh install.shwill install the necessary packages in the project. - code changes: changes the code
src/util.pyline 31-38 and line 49, to fill in your model paths and data paths.
All the running scripts, including scripts for auto-regress decoding, vanilla speculative decoding, parallel speculative decoding, comparison, ablation studies and case studies. These scripts can be directly executed for reproduction.
sh scripts/run_para_sd.shYou can try CSV-Decode with the following commands:
Basic CSV-Decode evaluation:
python benchmark/eval_csv_decode.py \
--eval_mode csv_decode \
--use_csv_decode \
--csv_num_clusters 2000 \
--csv_epsilon 0.05 \
--draft_model codellama-7b \
--target_model codellama-70bAutomated CSV-Decode evaluation script:
sh scripts/run_csv_decode.sh \
--model codellama-7b \
--target_model codellama-70b \
--num_samples 5 \
--csv_num_clusters 2000 \
--csv_epsilon 0.05CSV-Decode with custom parameters:
python benchmark/eval_csv_decode.py \
--eval_mode csv_decode \
--use_csv_decode \
--csv_num_clusters 3000 \
--csv_epsilon 0.01 \
--csv_top_k 20 \
--embedding_dim 4096 \
--draft_model codellama-7b \
--target_model codellama-70bWe have provided a suggested web interface, which you can use by running the following command.
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 applications.py --eval_mode csv_decode --use_csv_decode -n 1 -e applications --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0--use_csv_decode: Enable CSV-Decode acceleration--csv_num_clusters: Number of clusters (default: 0.015 × vocab_size)--csv_epsilon: Softmax approximation tolerance (default: 0.05)--csv_top_k: Top-k for certification (default: 10)--embedding_dim: Embedding dimension (default: 4096)
- Cluster Count: Higher counts provide tighter bounds but increase overhead
- Epsilon: Lower values provide better approximation but may increase fallback rates
- Top-K: Higher values improve certification but may reduce speedup
CSV-Decode has been tested on:
- Llama-2/3 (7B, 70B)
- CodeLlama (7B, 13B, 34B, 70B)
- Qwen2.5 (7B, 14B, 72B)
- DeepSeek-Coder (1.3B, 6.7B, 33B)
- Mistral-7B
- Yi-34B
- AttributeError: 'list' object has no attribute 'get_seq_length'.
In latest transformers (version >= 4.49.0), all the past_key_values are class of DynamicCache instead of tuple. Hence you should change the error line of code from past_key_values[0][0].shape[2] to past_key_values.get_seq_length(). We have fixed some bugs within the code. If you find any bug, feel free to raise an issue.
- Unexpected generations, such as meaningless text.
This issue may be directly caused due to precision overflow. You can add .to(torch.float32) to solve this issue. (Such as Line 187 of src/engine.py)
- CSV-Decode: CUDA Out of Memory.
This issue may occur with large vocabulary sizes. Try reducing --csv_num_clusters or --max_sub_vocab_size in the configuration.
- CSV-Decode: Low Certification Rate.
If certification rates are low, try increasing --csv_num_clusters or adjusting --csv_epsilon to a higher value (e.g., 0.1).
- CSV-Decode: High Fallback Rate.
High fallback rates indicate loose bounds. Check cluster quality or reduce --csv_top_k for better performance.
- CSV-Decode Performance Optimization.
For optimal performance:
- Ensure good vocabulary clustering with appropriate cluster count
- Use optimized sparse GEMV kernels for your hardware
- Process multiple sequences together for better GPU utilization
If you find our work useful your research, please cite our paper:
@article{liu2025csvdecode,
title={CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference},
author={Liu, Dong and Yu, Yanxuan and Lengerich, Ben},
year={2025}
}
This codebase is built upon the PEARL framework. We gratefully acknowledge the PEARL team for providing the foundation that made this CSV-Decode implementation possible.