Skip to content

[Feat][CLI] enforce-include-usage #19695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

max-wittig
Copy link
Contributor

@max-wittig max-wittig commented Jun 16, 2025

Currently, when streaming the usage is always null. This prevents the usage of limits per user and is a bit unexpected as without using streaming, the usage is always returned.

This can be used, if vllm is between a router, such as vllm-router or litellm and serves many users and is important to detect abuse, divide costs etc.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Always returns tokens which can be used by various systems for:

  • Detecting abuse by users
  • Billing or dividing cost company internally

In addition this align the behavior with the non-streaming mode where the usage is always returned

Test Plan

  1. Setup this commit and run vllm
  2. Send the follow request and notice how usage is returned in the last segment
curl --request POST \
  --url https://yourhost.example.com/llm/v1/completions \
  --header 'apikey: {{token}}' \
  --header 'content-type: application/json' \
  --data '{
  "model": "qwen3-30b-a3b",
  "max_tokens": 100,
  "presence_penalty": 0,
  "frequency_penalty": 0,
  "temperature": 0.1,
  "prompt": "def write_hello():",
  "stream": true
}'

data: {"id":"cmpl-ef83ad46-8ca3-49dc-8371-790f281f60a1#8733163","object":"text_completion","created":1750142471,"model":"qwen3-30b-a3b","choices":[],"usage":{"prompt_tokens":4,"total_tokens":104,"completion_tokens":100}}

Test Result

(Optional) Documentation Update

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @max-wittig, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a new configuration option to enforce the inclusion of usage statistics in API responses, specifically addressing the current limitation where usage is not reported during streaming. This feature is intended to support use cases where vLLM is used behind a router or load balancer and per-request usage tracking is required for billing, rate limiting, or abuse detection.

Highlights

  • New Configuration Option: Introduced a new boolean option enforce_include_usage in ModelConfig to control whether usage statistics are always included in API responses, even for streaming requests.
  • Streaming Usage Inclusion: Modified the streaming response generators (chat_completion_stream_generator and completion_stream_generator) to include usage statistics in the final chunk if the new enforce_include_usage option is enabled, overriding the default behavior for streaming.
  • API Server Integration: The new enforce_include_usage configuration is read from the model config and passed down to the OpenAIServingChat and OpenAIServingCompletion instances during API server initialization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Jun 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new configuration option enforce_include_usage to consistently include usage statistics in API responses, particularly for streaming scenarios. The changes are well-contained and logically propagate the new flag through the relevant serving classes.

Key areas for improvement:

  • Please add a docstring for the new enforce_include_usage field in ModelConfig to clarify its purpose.
  • Consider if Optional[bool] is strictly necessary for enforce_include_usage in ModelConfig or if bool would be more appropriate.
  • The __init__ method parameter enforce_include_usage in OpenAIServingChat is missing its type hint.

Additionally, please ensure the PR description (purpose, test plan, test results) is thoroughly filled out. Adding unit tests to verify the behavior of this new flag, especially its interaction with stream_options.include_usage, would be beneficial for long-term maintainability.

@max-wittig max-wittig force-pushed the feat/add-enforce-include-usage-option branch 9 times, most recently from 3eb0562 to 7befe5c Compare June 17, 2025 08:37
@max-wittig
Copy link
Contributor Author

Local testing blocked by #15985

@max-wittig max-wittig marked this pull request as ready for review June 17, 2025 11:52
@max-wittig max-wittig requested a review from aarnphm as a code owner June 17, 2025 11:52
@mergify mergify bot added the qwen Related to Qwen models label Jun 18, 2025
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some light comments.

@aarnphm aarnphm added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 19, 2025
@max-wittig
Copy link
Contributor Author

@aarnphm Thanks for the review! Let me know, if I should squash my commits or if any other changes are required!

@max-wittig max-wittig force-pushed the feat/add-enforce-include-usage-option branch 5 times, most recently from 6d3f5bc to 482b261 Compare June 20, 2025 08:08
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I forgot to include these comments. Thanks for the perseverance.

vllm/config.py Outdated
Comment on lines 422 to 423
enforce_include_usage: bool = False
"""Enforce including usage on every request."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this seems to be strictly frontend, we proably don't want it to be in model_config here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to include it in the CLI, see the enable-prompt-tokens-details options.

maybe we can also have a enable-force-include-usage here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I think we should also add an options to the request level, where some might want certain request to have usage, some requests without. Given that usage will affect throughput quite a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our perspective as an operator, we always want to see the usage. I'm not sure how people would use the flag with half usage data. That might be unexpected.

@aarnphm aarnphm changed the title feat: add enforce_include_usage option [Feat] Add enforce_include_usage option Jun 22, 2025
@max-wittig max-wittig force-pushed the feat/add-enforce-include-usage-option branch from c8f21a3 to 1877c16 Compare June 24, 2025 10:35
Currently, when streaming the usage is
always null. This prevents the usage of
limits per user and is a bit unexpected as
without using streaming, the usage is always
returned.

This can be used, if vllm is between a
router, such as vllm-router or litellm.
and serves many users and is important to
detect abuse, divide costs etc.

Signed-off-by: Max Wittig <[email protected]>
@max-wittig max-wittig force-pushed the feat/add-enforce-include-usage-option branch from 1877c16 to 690fd49 Compare June 24, 2025 11:06
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm fine with having this as the CLI.

@aarnphm aarnphm changed the title [Feat] Add enforce_include_usage option [Feat][CLI] enforce-include-usage Jun 25, 2025
@aarnphm aarnphm merged commit f59fc60 into vllm-project:main Jun 25, 2025
70 checks passed
@max-wittig
Copy link
Contributor Author

@aarnphm Thank you! Is there a place where I could put some docs for this feature?

@aarnphm
Copy link
Collaborator

aarnphm commented Jun 25, 2025

No need to, on https://docs.vllm.ai/en/latest/cli/index.html we mentioned for --help, which you already include the helpstring for it.

gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants