BitDecoding is a high-performance, GPU-optimized system
designed to accelerate long-context LLMs decoding with a low-bit KV
cache. Achieve 3-9x speedup than Flash Attention-v2.

- [2025.11] 🔥 BitDecoding has been accepted to HPCA 2025!
git clone --recursive https://github.com/DD-DuDa/BitDecoding.git
conda create -n bitdecode python=3.10
conda activate bitdecode
pip install -r requirements.txt
python setup.py install
- See benchmark/bench_single_decode.ipynb
- (Optional) Play with libtorch c++
# download libtorch cd BitDecoding/csrc/bit_decode mkdir build && cd build cmake -DCMAKE_PREFIX_PATH=<libtorch_path> .. make -j12 - End2end inference example, please see e2e
If you find BitDecoding useful or want to use in your projects, please kindly cite our paper:
@misc{du2025bitdecodingunlockingtensorcores,
title={BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache},
author={Dayou Du and Shijie Cao and Jianyi Cheng and Luo Mai and Ting Cao and Mao Yang},
year={2025},
eprint={2503.18773},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2503.18773},
}
BitLadder is inspired by many open-source libraries, including (but not limited to) flash-attention, flute, Atom, omniserve, KIVI.

