| Intel® Gaudi® README | Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |
Caution
Starting from v1.23.0, the vLLM fork will reach end-of-life (EOL) and be deprecated in v1.24.0, remaining functional only for legacy use cases until then. At the same time, the vllm-gaudi plugin will be production-ready in v1.23.0 and will become the default by v1.24.0. This plugin integrates Intel Gaudi with vLLM for optimized LLM inference and is intended for future deployments. We strongly suggest preparing a migration path toward the plugin version: https://github.com/vllm-project/vllm-gaudi.
Note
For Intel Gaudi specific setup instructions and examples, please refer Intel® Gaudi® README. For jupyter notebook based quickstart tutorials refer Getting Started with vLLM and Understanding vLLM on Gaudi.
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
Performance benchmark: We include a performance benchmark at the end of our blog post. It compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, SGLang and LMDeploy). The implementation is under nightly-benchmarks folder and you can reproduce this benchmark using our one-click runnable script.
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models here.
Install vLLM with pip
or from source:
pip install vllm
Visit our documentation to learn more.
We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
Cash Donations:
- a16z
- Dropbox
- Sequoia Capital
- Skywork AI
- ZhenFund
Compute Resources:
- AMD
- Anyscale
- AWS
- Crusoe Cloud
- Databricks
- DeepInfra
- Google Cloud
- Intel
- Lambda Lab
- Nebius
- Novita AI
- NVIDIA
- Replicate
- Roblox
- RunPod
- Trainy
- UC Berkeley
- UC San Diego
Slack Sponsor: Anyscale
We also have an official fundraising venue through OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.
If you use vLLM for your research, please cite our paper:
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
- For technical questions and feature requests, please use GitHub Issues or Discussions
- For discussing with fellow users, please use the vLLM Forum
- coordinating contributions and development, please use Slack
- For security disclosures, please use GitHub's Security Advisories feature
- For collaborations and partnerships, please contact us at [email protected]
- If you wish to use vLLM's logo, please refer to our media kit repo