Add Serving Benchmark Script #29
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a new benchmark script,
serving_bench.py
, to evaluate the engine's performance under a continuous load of incoming requests, simulating a real-world serving scenario.Note: This PR is purely additive. No core files have been modified.
Key Features of
serving_bench.py
tqdm
to display real-time progress and average latency.Benchmark Results
The following results demonstrate the system's performance under different request rates (1 L20 48GB GPU, Qwen3-0.6B).
The results show that throughput scales effectively with the request rate, which validates the dynamic batching mechanism. As expected, higher throughput is achieved at the cost of increased latency.
How to Use
# Run the benchmark with a specific request rate python serving_bench.py --request-rate 16 --num-requests 256