I want to disable thinking for thinking models in load testing #265

psydok · 2025-08-07T06:54:12Z

psydok
Aug 7, 2025

How do I do this when testing the load through guidellm?
In other words, I need to add a parameter to the request body: "chat_template_kwargs":{"enable_thinking":False}

Answered by sjmonson

Aug 7, 2025

You can specify extra request parameters using the extra_body backend arg: --backend-args='{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}'

View full answer

sjmonson · 2025-08-07T15:01:41Z

sjmonson
Aug 7, 2025
Maintainer

You can specify extra request parameters using the extra_body backend arg: --backend-args='{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}'

0 replies

psydok · 2025-08-08T08:33:14Z

psydok
Aug 8, 2025
Author

@sjmonson Thanks for answer!

I want to run test through docker. Can you tell me how it would be correct to pass these values through environment variables?

Would it be right to do it this way? And how can this be done through env?
When guidellm config is called, all variables that can be redefined are not displayed.

sudo docker run \
  --rm -it \
  -v "./data/guidellm:/results:rw" \
  -e GUIDELLM_TARGET=http://localhost:8000 \
  -e GUIDELLM_MODEL=Qwen/Qwen3-30B-A3B \
  -e GUIDELLM_PROCESSOR=Qwen/Qwen3-30B-A3B  \
  -e GUIDELLM_RANDOM_SEED=2025  \
  -e GUIDELLM_RATE_TYPE=concurrent -e GUIDELLM_RATE=1,3,5,8 \
  -e GUIDELLM_MAX_SECONDS=300 \
  -e GUIDELLM_DATA="prompt_tokens=4096,output_tokens=512" \
  -e GUIDELLM__PREFERRED_ROUTE="chat_completions" \
  -e GUIDELLM__REQUEST_HTTP2=0 \
  ghcr.io/vllm-project/guidellm:latest -- --backend-args='{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}'

1 reply

sjmonson Aug 8, 2025
Maintainer

For the container image the default CMD is benchmark run. In the example above you override benchmark run by specifying arguments. To fix this just reintroduce benchmark run:

sudo docker run \
  --rm -it \
  -v "./data/guidellm:/results:rw" \
  -e GUIDELLM_TARGET=http://localhost:8000 \
  -e GUIDELLM_MODEL=Qwen/Qwen3-30B-A3B \
  -e GUIDELLM_PROCESSOR=Qwen/Qwen3-30B-A3B  \
  -e GUIDELLM_RANDOM_SEED=2025  \
  -e GUIDELLM_RATE_TYPE=concurrent -e GUIDELLM_RATE=1,3,5,8 \
  -e GUIDELLM_MAX_SECONDS=300 \
  -e GUIDELLM_DATA="prompt_tokens=4096,output_tokens=512" \
  -e GUIDELLM__PREFERRED_ROUTE="chat_completions" \
  -e GUIDELLM__REQUEST_HTTP2=0 \
  ghcr.io/vllm-project/guidellm:latest -- benchmark run --backend-args='{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}'

psydok · 2025-08-08T10:50:36Z

psydok
Aug 8, 2025
Author

The above did not work. I tried passing it like this: -e GUIDELLM_BACKEND_ARGS='{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}'. It seems to work this way for rate-type=concurrent.

Thank you very much!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I want to disable thinking for thinking models in load testing #265

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

I want to disable thinking for thinking models in load testing #265

Uh oh!

psydok Aug 7, 2025

Replies: 3 comments · 1 reply

Uh oh!

sjmonson Aug 7, 2025 Maintainer

Uh oh!

psydok Aug 8, 2025 Author

Uh oh!

sjmonson Aug 8, 2025 Maintainer

Uh oh!

psydok Aug 8, 2025 Author

psydok
Aug 7, 2025

Replies: 3 comments 1 reply

sjmonson
Aug 7, 2025
Maintainer

psydok
Aug 8, 2025
Author

sjmonson Aug 8, 2025
Maintainer

psydok
Aug 8, 2025
Author