[Feat][CLI] enforce-include-usage #19695

max-wittig · 2025-06-16T15:11:33Z

Currently, when streaming the usage is always null. This prevents the usage of limits per user and is a bit unexpected as without using streaming, the usage is always returned.

This can be used, if vllm is between a router, such as vllm-router or litellm and serves many users and is important to detect abuse, divide costs etc.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Always returns tokens which can be used by various systems for:

Detecting abuse by users
Billing or dividing cost company internally

In addition this align the behavior with the non-streaming mode where the usage is always returned

Test Plan

Setup this commit and run vllm
Send the follow request and notice how usage is returned in the last segment

curl --request POST \
  --url https://yourhost.example.com/llm/v1/completions \
  --header 'apikey: {{token}}' \
  --header 'content-type: application/json' \
  --data '{
  "model": "qwen3-30b-a3b",
  "max_tokens": 100,
  "presence_penalty": 0,
  "frequency_penalty": 0,
  "temperature": 0.1,
  "prompt": "def write_hello():",
  "stream": true
}'

data: {"id":"cmpl-ef83ad46-8ca3-49dc-8371-790f281f60a1#8733163","object":"text_completion","created":1750142471,"model":"qwen3-30b-a3b","choices":[],"usage":{"prompt_tokens":4,"total_tokens":104,"completion_tokens":100}}

Test Result

(Optional) Documentation Update

gemini-code-assist

Summary of Changes

Hello @max-wittig, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a new configuration option to enforce the inclusion of usage statistics in API responses, specifically addressing the current limitation where usage is not reported during streaming. This feature is intended to support use cases where vLLM is used behind a router or load balancer and per-request usage tracking is required for billing, rate limiting, or abuse detection.

Highlights

New Configuration Option: Introduced a new boolean option enforce_include_usage in ModelConfig to control whether usage statistics are always included in API responses, even for streaming requests.
Streaming Usage Inclusion: Modified the streaming response generators (chat_completion_stream_generator and completion_stream_generator) to include usage statistics in the final chunk if the new enforce_include_usage option is enabled, overriding the default behavior for streaming.
API Server Integration: The new enforce_include_usage configuration is read from the model config and passed down to the OpenAIServingChat and OpenAIServingCompletion instances during API server initialization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-06-16T15:12:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a new configuration option enforce_include_usage to consistently include usage statistics in API responses, particularly for streaming scenarios. The changes are well-contained and logically propagate the new flag through the relevant serving classes.

Key areas for improvement:

Please add a docstring for the new enforce_include_usage field in ModelConfig to clarify its purpose.
Consider if Optional[bool] is strictly necessary for enforce_include_usage in ModelConfig or if bool would be more appropriate.
The __init__ method parameter enforce_include_usage in OpenAIServingChat is missing its type hint.

Additionally, please ensure the PR description (purpose, test plan, test results) is thoroughly filled out. Adding unit tests to verify the behavior of this new flag, especially its interaction with stream_options.include_usage, would be beneficial for long-term maintainability.

max-wittig · 2025-06-17T11:52:54Z

Local testing blocked by #15985

aarnphm

some light comments.

vllm/entrypoints/openai/serving_completion.py

max-wittig · 2025-06-20T06:56:01Z

@aarnphm Thanks for the review! Let me know, if I should squash my commits or if any other changes are required!

aarnphm

For some reason I forgot to include these comments. Thanks for the perseverance.

aarnphm · 2025-06-22T19:38:18Z

vllm/config.py

+    enforce_include_usage: bool = False
+    """Enforce including usage on every request."""


Actually this seems to be strictly frontend, we proably don't want it to be in model_config here

I think it is better to include it in the CLI, see the enable-prompt-tokens-details options.

maybe we can also have a enable-force-include-usage here.

fwiw I think we should also add an options to the request level, where some might want certain request to have usage, some requests without. Given that usage will affect throughput quite a lot.

From our perspective as an operator, we always want to see the usage. I'm not sure how people would use the flag with half usage data. That might be unexpected.

Currently, when streaming the usage is always null. This prevents the usage of limits per user and is a bit unexpected as without using streaming, the usage is always returned. This can be used, if vllm is between a router, such as vllm-router or litellm. and serves many users and is important to detect abuse, divide costs etc. Signed-off-by: Max Wittig <[email protected]>

aarnphm

Ok, I'm fine with having this as the CLI.

max-wittig · 2025-06-25T15:03:44Z

@aarnphm Thank you! Is there a place where I could put some docs for this feature?

aarnphm · 2025-06-25T16:55:41Z

No need to, on https://docs.vllm.ai/en/latest/cli/index.html we mentioned for --help, which you already include the helpstring for it.

Signed-off-by: Max Wittig <[email protected]>

Signed-off-by: Max Wittig <[email protected]> Signed-off-by: Will Eaton <[email protected]>

Signed-off-by: Max Wittig <[email protected]>

gemini-code-assist bot reviewed Jun 16, 2025

View reviewed changes

mergify bot added the frontend label Jun 16, 2025

gemini-code-assist bot reviewed Jun 16, 2025

View reviewed changes

max-wittig force-pushed the feat/add-enforce-include-usage-option branch 9 times, most recently from 3eb0562 to 7befe5c Compare June 17, 2025 08:37

max-wittig marked this pull request as ready for review June 17, 2025 11:52

max-wittig requested a review from aarnphm as a code owner June 17, 2025 11:52

mergify bot added the qwen Related to Qwen models label Jun 18, 2025

aarnphm reviewed Jun 19, 2025

View reviewed changes

vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved

aarnphm added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 19, 2025

max-wittig force-pushed the feat/add-enforce-include-usage-option branch 5 times, most recently from 6d3f5bc to 482b261 Compare June 20, 2025 08:08

aarnphm requested changes Jun 22, 2025

View reviewed changes

aarnphm changed the title ~~feat: add enforce_include_usage option~~ [Feat] Add enforce_include_usage option Jun 22, 2025

max-wittig force-pushed the feat/add-enforce-include-usage-option branch from c8f21a3 to 1877c16 Compare June 24, 2025 10:35

max-wittig force-pushed the feat/add-enforce-include-usage-option branch from 1877c16 to 690fd49 Compare June 24, 2025 11:06

aarnphm approved these changes Jun 25, 2025

View reviewed changes

aarnphm changed the title ~~[Feat] Add enforce_include_usage option~~ [Feat][CLI] enforce-include-usage Jun 25, 2025

aarnphm merged commit f59fc60 into vllm-project:main Jun 25, 2025
70 checks passed

gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025

[Feat][CLI] enforce-include-usage (vllm-project#19695)

11b95a4

Signed-off-by: Max Wittig <[email protected]>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[Feat][CLI] enforce-include-usage (vllm-project#19695)

88d23b9

Signed-off-by: Max Wittig <[email protected]>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Feat][CLI] enforce-include-usage (vllm-project#19695)

3422535

Signed-off-by: Max Wittig <[email protected]> Signed-off-by: Will Eaton <[email protected]>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Feat][CLI] enforce-include-usage (vllm-project#19695)

ac91363

Signed-off-by: Max Wittig <[email protected]>

wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025

[Feat][CLI] enforce-include-usage (vllm-project#19695)

b85522e

Signed-off-by: Max Wittig <[email protected]>

max-wittig mentioned this pull request Jul 17, 2025

fix(completion): always include usage #20983

Open

4 tasks

		enforce_include_usage: bool = False
		"""Enforce including usage on every request."""

Uh oh!

[Feat][CLI] enforce-include-usage #19695

[Feat][CLI] enforce-include-usage #19695

Conversation

max-wittig commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

max-wittig commented Jun 17, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

max-wittig commented Jun 20, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

aarnphm Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

aarnphm Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

aarnphm Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

max-wittig Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

max-wittig commented Jun 25, 2025

Uh oh!

aarnphm commented Jun 25, 2025

Uh oh!

Uh oh!

max-wittig commented Jun 16, 2025 •

edited

Loading