Large Model High-Concurrency Deployment Investigate and Discuss #12113

xueshuai0922 · 2025-01-16T09:34:16Z

xueshuai0922
Jan 16, 2025

Large Model High-Concurrency Deployment Investigate and Discuss

Overview

Date: January 16, 2025, Thursday 17:31.
Document Version: V1.0
Author: Shuai Xue

Prerequisites:

A 70B model with a concurrency requirement of 200.
The benchmarking of LLM inference backends shows that one NVIDIA A100 (80GB) can support 100 concurrent requests after 4-bit quantization of a 70B model.
Reference link: Benchmarking LLM Inference Backends

Key Points:

VRAM Requirement Analysis:
- For FP8 precision, loading a single 70B model requires approximately 35GB VRAM.
- Considering intermediate variables and cache during inference, it's recommended to reserve 1.5 times the VRAM, i.e., about 52.5GB per model.
- Considering KV Cache: each concurrent session requires approximately 0.5GB VRAM.
- Total VRAM needed for 200 concurrent sessions: 52GB + (0.5GB × 200) = 152 GB VRAM.
Hardware Configuration Recommendations:
- Minimum of 2 cards, recommended to use 3-card servers.
- GPU selection: NVIDIA H100 (80GB VRAM/card) or A100 (80GB VRAM/card).
Deployment Solutions:
- Why Choose vLLM for Inference?
  - Community stability.
  - High concurrency and multi-GPU parallelism.
  - Numerous practical references available.
- Specific Solutions:
  - Use multiple vLLM instances to leverage two/three A100 GPUs, providing external interfaces.
  - Ensure user conversation continuity by having Nginx route requests based on IP.
- Throughput vs Latency Explanation:
  - Single-instance multi-GPU: Better throughput if the focus is on maximizing throughput, using NCCL for parallel processing across GPUs.
  - Multiple Instances + Nginx Load Balancer: Higher concurrency and finer-grained load distribution, with acceptable minor cross-instance communication latency.
- Other Deployment Options:
  - Kubernetes + LLM.
  - Inference tools like LMDeploy.
- Performance Testing References:
  - NCCL: NVIDIA NCCL Developer Guide
  - vLLM + NCCL: vLLM Answer NVLink NCCL
  - vLLM + Nginx Load Balancer + Docker: Deploying with Nginx
  - vLLM GitHub: vLLM Project
Comparison of Inference Tools:
- SkyPilot:
  - Pros: Automatic resource scheduling, multi-cloud support, simplified deployment.
  - Cons: Limited community support, primarily cloud-focused.
  - Link: SkyPilot Documentation
- KServe:
  - Pros: Integrated with Kubernetes, supports various frameworks, automated management.
  - Cons: Depends on Kubernetes, higher learning curve.
  - Link: KServe Website
- KubeAI:
  - Pros: Automated resource management, high availability, compatible with cloud platforms.
  - Cons: Unstable community, higher costs, requires configuration and integration work.
  - Link: KubeAI
- Kubernetes:
  - Pros: Powerful scalability, containerized deployment, extensive ecosystem support.
  - Cons: Complex deployment and management, performance tuning requires expertise.
LMDeploy vs vLLM:
- LMDeploy performed better in simple tests compared to vLLM but has less community practice. It serves as an alternative option for future inference tools.

xueshuai0922 · 2025-01-16T09:35:09Z

xueshuai0922
Jan 16, 2025
Author

anyone have a idea which will solve this problem?

1 reply

xueshuai0922 Jan 16, 2025
Author

any practice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Large Model High-Concurrency Deployment Investigate and Discuss #12113

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Large Model High-Concurrency Deployment Investigate and Discuss #12113

Uh oh!

xueshuai0922 Jan 16, 2025

Large Model High-Concurrency Deployment Investigate and Discuss

Overview

Prerequisites:

Key Points:

Replies: 1 comment · 1 reply

Uh oh!

xueshuai0922 Jan 16, 2025 Author

Uh oh!

xueshuai0922 Jan 16, 2025 Author

xueshuai0922
Jan 16, 2025

Replies: 1 comment 1 reply

xueshuai0922
Jan 16, 2025
Author

xueshuai0922 Jan 16, 2025
Author