CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

News 🔥

[2025/10] CSV-Decode: Novel approach for efficient LLM inference using geometric upper bounds
[2025/09] 2-4.95x speedup over autoregressive baseline with provable correctness guarantees
[2025/09] compatibility with LLM architecture without retraining

TL; DR: we introduce CSV-Decode (Certifiable Sub-Vocabulary Decoding) to address the output layer bottleneck in Large Language Models (LLMs). CSV-Decode uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees. In summary, our CSV-Decode is:

🔥 2.67-4.95x speedup over autoregressive baseline across multiple models and tasks

99.3% quality retention with <2% fallback rates to full vocabulary computation

52% energy reduction compared to baseline inference

Provably correct: Exact top-k certification and ε-certified softmax approximations

Training-free: Works with any pre-trained LLM without retraining

Universal compatibility: Supports all major LLM architectures (Llama, CodeLlama, Qwen, DeepSeek, etc.)

Advanced optimizations: Sparse GEMV kernels, multi-GPU sharding, and CUDA Graph integration

Overview of CSV-Decode

CSV-Decode addresses the fundamental bottleneck in LLM inference: the expensive output layer computation over large vocabularies. Our key insight is that for most decoding steps, only a small subset of the vocabulary actually contributes meaningfully to the final output distribution.

Core Innovation: Geometric Upper Bounds

CSV-Decode leverages geometric reasoning to identify which tokens can be safely omitted from computation:

Vocabulary Clustering: Offline K-means clustering groups semantically similar tokens
Cauchy-Schwarz Bounds: Derives tight upper bounds on logits for entire clusters
Certification Mechanisms: Provides exact top-k certification and ε-certified softmax approximations
Adaptive Algorithm: Dynamically expands sub-vocabularies based on certification criteria

Mathematical Foundation

For any token i in cluster c, we bound the logit using Cauchy-Schwarz inequality:

ℓi(t) = ⟨Wi, ht⟩ + bi ≤ ⟨μc, ht⟩ + Rc∥ht∥2 + max_j∈c bj

Where:

μc: Cluster centroid
Rc: Cluster radius
ht: Hidden state at step t

This enables us to skip entire clusters without computing individual token logits, while maintaining provable correctness guarantees.

preparation

Follow the instructions below to prepare for reproducing the results in the paper.

experimental environment: sh install.sh will install the necessary packages in the project.
code changes: changes the code src/util.py line 31-38 and line 49, to fill in your model paths and data paths.

reproduction

All the running scripts, including scripts for auto-regress decoding, vanilla speculative decoding, parallel speculative decoding, comparison, ablation studies and case studies. These scripts can be directly executed for reproduction.

sh scripts/run_para_sd.sh

Examples

You can try CSV-Decode with the following commands:

Basic CSV-Decode evaluation:

python benchmark/eval_csv_decode.py \
    --eval_mode csv_decode \
    --use_csv_decode \
    --csv_num_clusters 2000 \
    --csv_epsilon 0.05 \
    --draft_model codellama-7b \
    --target_model codellama-70b

Automated CSV-Decode evaluation script:

sh scripts/run_csv_decode.sh \
    --model codellama-7b \
    --target_model codellama-70b \
    --num_samples 5 \
    --csv_num_clusters 2000 \
    --csv_epsilon 0.05

CSV-Decode with custom parameters:

python benchmark/eval_csv_decode.py \
    --eval_mode csv_decode \
    --use_csv_decode \
    --csv_num_clusters 3000 \
    --csv_epsilon 0.01 \
    --csv_top_k 20 \
    --embedding_dim 4096 \
    --draft_model codellama-7b \
    --target_model codellama-70b

With UI

We have provided a suggested web interface, which you can use by running the following command.

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 applications.py --eval_mode csv_decode --use_csv_decode -n 1  -e applications --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0

CSV-Decode Configuration

Key Parameters

--use_csv_decode: Enable CSV-Decode acceleration
--csv_num_clusters: Number of clusters (default: 0.015 × vocab_size)
--csv_epsilon: Softmax approximation tolerance (default: 0.05)
--csv_top_k: Top-k for certification (default: 10)
--embedding_dim: Embedding dimension (default: 4096)

Performance Tuning

Cluster Count: Higher counts provide tighter bounds but increase overhead
Epsilon: Lower values provide better approximation but may increase fallback rates
Top-K: Higher values improve certification but may reduce speedup

Supported Models

CSV-Decode has been tested on:

Llama-2/3 (7B, 70B)
CodeLlama (7B, 13B, 34B, 70B)
Qwen2.5 (7B, 14B, 72B)
DeepSeek-Coder (1.3B, 6.7B, 33B)
Mistral-7B
Yi-34B

FAQ

AttributeError: 'list' object has no attribute 'get_seq_length'.

In latest transformers (version >= 4.49.0), all the past_key_values are class of DynamicCache instead of tuple. Hence you should change the error line of code from past_key_values[0][0].shape[2] to past_key_values.get_seq_length(). We have fixed some bugs within the code. If you find any bug, feel free to raise an issue.

Unexpected generations, such as meaningless text.

This issue may be directly caused due to precision overflow. You can add .to(torch.float32) to solve this issue. (Such as Line 187 of src/engine.py)

CSV-Decode: CUDA Out of Memory.

This issue may occur with large vocabulary sizes. Try reducing --csv_num_clusters or --max_sub_vocab_size in the configuration.

CSV-Decode: Low Certification Rate.

If certification rates are low, try increasing --csv_num_clusters or adjusting --csv_epsilon to a higher value (e.g., 0.1).

CSV-Decode: High Fallback Rate.

High fallback rates indicate loose bounds. Check cluster quality or reduce --csv_top_k for better performance.

CSV-Decode Performance Optimization.

For optimal performance:

Ensure good vocabulary clustering with appropriate cluster count
Use optimized sparse GEMV kernels for your hardware
Process multiple sequences together for better GPU utilization

Citation

If you find our work useful your research, please cite our paper:

@article{liu2025csvdecode,
  title={CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference},
  author={Liu, Dong and Yu, Yanxuan and Lengerich, Ben},
  year={2025}
}

Acknowledgments

This codebase is built upon the PEARL framework. We gratefully acknowledge the PEARL team for providing the foundation that made this CSV-Decode implementation possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

Overview of CSV-Decode

Core Innovation: Geometric Upper Bounds

Mathematical Foundation

preparation

reproduction

Examples

With UI

CSV-Decode Configuration

Key Parameters

Performance Tuning

Supported Models

FAQ

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmark		benchmark
comparison		comparison
data		data
scripts		scripts
src		src
static		static
.DS_Store		.DS_Store
README.md		README.md
applications.py		applications.py
install.sh		install.sh

FastLM/CSV-Decode

Folders and files

Latest commit

History

Repository files navigation

CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

Overview of CSV-Decode

Core Innovation: Geometric Upper Bounds

Mathematical Foundation

preparation

reproduction

Examples

With UI

CSV-Decode Configuration

Key Parameters

Performance Tuning

Supported Models

FAQ

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages