Skip to content

Conversation

windreamer
Copy link
Collaborator

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

  1. Build-system

    • Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
    • Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
  2. Core C++ changes

    • DynamicDecodeLayer pipeline extended with two new layers:
      • GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
      • GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
    • Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
  3. Python frontend

    • Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
    • turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

  • Pre-commit hooks (clang-format, flake8, mypy) passed.
  • Document updated

@windreamer windreamer changed the title Guided decoding with xgrammar [WIP] Guided decoding with xgrammar Sep 12, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44
@shell-nlp
Copy link
Contributor

good job!

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41
@windreamer
Copy link
Collaborator Author

Also done replace outdated outlines with xgrammar in PyTorch Engine. And this enabled us to:

  • No longer need to restrict numpy<2
  • No buggy pyairports packages as dependencies

@windreamer windreamer requested a review from grimoire September 28, 2025 09:06
@windreamer windreamer linked an issue Sep 28, 2025 that may be closed by this pull request
torchvision>=0.18.1,<0.23.0
transformers
uvicorn
xgrammar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does xgrammar support npu?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No,but currently XGrammar in PyTorch engine uses cpu kernels instead of gpu ones. And it seems also works in arm64 (untested)

@lvhan028 lvhan028 added the enhancement New feature or request label Sep 29, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from 1a55fc2 to 3a9b7ab Compare September 30, 2025 08:24
@grimoire
Copy link
Collaborator

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)

The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

@windreamer windreamer marked this pull request as draft September 30, 2025 10:43
@windreamer
Copy link
Collaborator Author

I don't know much about guided decoding. But I think there are bugs in the pytorch implementation (from the main branch)

The matcher is maintained in instances of RegexLogitsProcessor or JSONLogitsProcessor. And _get_guided_logits_processor will only cache 32 instances. Different request with same guide would get the same processor and old processor would be removed if more than 32 guide request comes in.

You are right! It is a bit tough...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] turbomind后端是否会支持guided_decoding
5 participants