GitHub - HabanaAI/vllm-fork: A high-throughput and memory-efficient inference and serving engine for LLMs

Easy, fast, and cheap LLM serving for everyone

Caution

Starting from v1.23.0, the vLLM fork will reach end-of-life (EOL) and be deprecated in v1.24.0, remaining functional only for legacy use cases until then. At the same time, the vllm-gaudi plugin will be production-ready in v1.23.0 and will become the default by v1.24.0. This plugin integrates Intel Gaudi with vLLM for optimized LLM inference and is intended for future deployments. We strongly suggest preparing a migration path toward the plugin version: https://github.com/vllm-project/vllm-gaudi.

Note

For Intel Gaudi specific setup instructions and examples, please refer Intel® Gaudi® README. For jupyter notebook based quickstart tutorials refer Getting Started with vLLM and Understanding vLLM on Gaudi.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
Speculative decoding
Chunked prefill

Performance benchmark: We include a performance benchmark at the end of our blog post. It compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, SGLang and LMDeploy). The implementation is under nightly-benchmarks folder and you can reproduce this benchmark using our one-click runnable script.

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron
Prefix caching support
Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
Embedding Models (e.g., E5-Mistral)
Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Contact Us

For technical questions and feature requests, please use GitHub Issues or Discussions
For discussing with fellow users, please use the vLLM Forum
coordinating contributions and development, please use Slack
For security disclosures, please use GitHub's Security Advisories feature
For collaborations and partnerships, please contact us at [email protected]

Media Kit

If you wish to use vLLM's logo, please refer to our media kit repo

Name		Name	Last commit message	Last commit date
Latest commit History 8,289 Commits
.buildkite		.buildkite
.cd		.cd
.github		.github
.jenkins		.jenkins
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
pd_xpyd		pd_xpyd
requirements		requirements
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_GAUDI.md		README_GAUDI.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
requirements-hpu.txt		requirements-hpu.txt
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Easy, fast, and cheap LLM serving for everyone

About

Getting Started

Contributing

Sponsors

Citation

Contact Us

Media Kit

About

Uh oh!

Releases 14

Packages

Uh oh!

Languages

License

HabanaAI/vllm-fork

Folders and files

Latest commit

History

Repository files navigation

Easy, fast, and cheap LLM serving for everyone

About

Getting Started

Contributing

Sponsors

Citation

Contact Us

Media Kit

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Languages

Packages