Guided decoding with xgrammar #3965

windreamer · 2025-09-12T09:16:51Z

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

Build-system
- Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
- Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
Core C++ changes
- DynamicDecodeLayer pipeline extended with two new layers:
  - GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
  - GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
- Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
Python frontend
- Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
- turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

Pre-commit hooks (clang-format, flake8, mypy) passed.
Document updated

shell-nlp · 2025-09-16T15:34:45Z

good job!

src/turbomind/layers/DynamicDecodeLayer.cc

src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc

windreamer · 2025-09-28T09:05:43Z

Also done replace outdated outlines with xgrammar in PyTorch Engine. And this enabled us to:

No longer need to restrict numpy<2
No buggy pyairports packages as dependencies

lvhan028 · 2025-09-29T13:06:11Z

requirements/runtime_ascend.txt

 torchvision>=0.18.1,<0.23.0
 transformers
 uvicorn
+xgrammar


Does xgrammar support npu?

No，but currently XGrammar in PyTorch engine uses cpu kernels instead of gpu ones. And it seems also works in arm64 (untested)

lmdeploy/serve/openai/api_server.py

grimoire · 2025-09-30T10:13:58Z

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)

The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

windreamer · 2025-09-30T11:08:17Z

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)

The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

You are right! It is a bit tough...

windreamer changed the title ~~Guided decoding with xgrammar~~ [WIP] Guided decoding with xgrammar Sep 12, 2025

windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44

windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41

lzhangzz reviewed Sep 25, 2025

View reviewed changes

src/turbomind/layers/DynamicDecodeLayer.cc Show resolved Hide resolved

src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc Outdated Show resolved Hide resolved

windreamer force-pushed the guided_decoding_with_xgrammar branch from 8413618 to 3c4cbdb Compare September 26, 2025 01:18

windreamer mentioned this pull request Sep 28, 2025

[Feature] turbomind后端是否会支持guided_decoding #2771

Open

windreamer requested a review from grimoire September 28, 2025 09:06

windreamer linked an issue Sep 28, 2025 that may be closed by this pull request

[Feature] turbomind后端是否会支持guided_decoding #2771

Open

lvhan028 reviewed Sep 29, 2025

View reviewed changes

lmdeploy/serve/openai/api_server.py Outdated Show resolved Hide resolved

lvhan028 added the enhancement New feature or request label Sep 29, 2025

windreamer added 17 commits September 30, 2025 16:19

feat(turbomind): bring xGrammar into build

d48e6f0

feat(turbomind): add skeleton for guided decoding layers

69184c9

feat(turbomind): add implementation for naive bitmap mask with a loop

904999e

add ModelRequest support for xgrammar

726a9ee

feat: enable grammar init in turbomind

bddf703

fix: fix some bug and add initial tests

384d5d3

feat: restructure the interface

6811450

feat: speedup with cuda inplace kernel

2adf7f5

fix: fix test case

e4a03cd

fix: use stream from context instead of the default stream

6c020a4

test: add matrix grammar test

69aa2e7

fix: simplify the bitmap apply kernel

24f1a92

feat: move tensor allocation to ctor

7ad7e20

test: temporarily disable pytorch engine tests as it is faulty

c45deea

feat: replace outlines with xgrammar in pytorch engine

de59d58

test: move timm to test requirements

add9f02

fix: enable openai guided decoding function for turbomind

3a9b7ab

windreamer force-pushed the guided_decoding_with_xgrammar branch from 1a55fc2 to 3a9b7ab Compare September 30, 2025 08:24

windreamer marked this pull request as draft September 30, 2025 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guided decoding with xgrammar #3965

Guided decoding with xgrammar #3965

windreamer commented Sep 12, 2025

Uh oh!

shell-nlp commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

windreamer commented Sep 28, 2025

Uh oh!

lvhan028 Sep 29, 2025

Uh oh!

windreamer Sep 29, 2025

Uh oh!

Uh oh!

grimoire commented Sep 30, 2025

Uh oh!

windreamer commented Sep 30, 2025

Uh oh!

Uh oh!

Guided decoding with xgrammar #3965

Are you sure you want to change the base?

Guided decoding with xgrammar #3965

Conversation

windreamer commented Sep 12, 2025

Motivation

Modification

Checklist

Uh oh!

shell-nlp commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

windreamer commented Sep 28, 2025

Uh oh!

lvhan028 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

windreamer Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grimoire commented Sep 30, 2025

Uh oh!

windreamer commented Sep 30, 2025

Uh oh!

Uh oh!