Adding support for Structured Generation with XGrammar #3232

ceferisbarov · 2025-08-12T18:16:46Z

This PR adds support for structured generation (SG) to lm-evalution-harness. @Saibo-creator and I have been working on this direction for a few weeks, and we already use our fork for internal experiments.

Motivation

SG is an established feature of most major LLM engines and it is used widely in both agentic systems and various reasoning tasks. There are alreay benchmarks that are intended to be used with SG (see JSONSchemaBench, StructEval). It is also possible (and an intereting area of exploration) to run other benchmarks with SG. We have made certain changes to lm-evaluation-harness to create such tasks easily.

A new model: `HFStructuredLM`

This PR adds a new model, HFStructuredLM that supports SG with XGrammar engine. It accepts three different grammars: json, regex, and GBNF. Two task setups are possible:

A separate grammar for each record: Add a separate grammar column to the dataset.
A common grammar for the entire task: create a grammar file.

Changes to the API

We also add three new metrics:

json_validity: Given a JSON string, returns 1 if the JSON is valid, 0 if not.
grammar_compliance: Given a grammar (JSON, regex, or GBNF) string, return 1 if the prediction matches the grammar, 0 if not.
json_answer_match: Given a JSON string and the target field, performs exact match.

Sample tasks

We have extended the existing GSM8K task as a working sample of SG with JSON output. The following GitHub Gist contains two files:

https://gist.github.com/ceferisbarov/88b85b0423486807f5b633ab4bf9baaa

You can place them in lm_eval/tasks/gsm8k or another task folder and test SG.

Dependencies

We have added two new dependencies jsonschema and xgrammar. We can also add these as an extra.

Plans

This is an ongoing work and we appreciate any feedback. Our backlog so far:

json_answer_match metric is an ad hoc solution. We need a more generalizable alternative. We can extend exact_match metric by adding new parameters.
grammar_compliance metric also contains ad hoc solutions, especially for JSON. We can make grammar_compliance an extensible metric where users import it to their task's metrics.py file and modify it.
Add documentation & tutorials.
Add other SG engines (we added only XGrammar for now).
Update jsonschemabench task to use default metrics instead of custom ones.

CLAassistant · 2025-08-12T18:16:52Z

All committers have signed the CLA.

stakodiak · 2025-08-18T05:31:31Z

I would be interested to see a comparison of changes in scores

ceferisbarov · 2025-08-21T18:53:29Z

@stakodiak

Here is a simple experiment:
Model: Qwen3-8B, full precision
Dataset: GSM8K, test set

No structured generation (original set-up):

|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.8855|±  |0.0088|
|         |       |strict-match    |     8|exact_match|↑  |0.8840|±  |0.0088|

Structured generation:

|         Tasks         |Version|Filter|n-shot|      Metric      |   |Value |   |Stderr|
|-----------------------|------:|------|-----:|------------------|---|-----:|---|-----:|
|gsm8k-enforced-sampling|      3|none  |     0|grammar_compliance|↑  |0.9454|±  |0.0062|
|                       |       |none  |     0|json_answer_match |↑  |0.6558|±  |0.0131|
|                       |       |none  |     0|json_validity     |↑  |0.9454|±  |0.0062|

We used the following JSON schema:

{
  "properties": {
    "reasoning": {
      "title": "Reasoning",
      "type": "string"
    },
    "answer": {
      "title": "Answer",
      "type": "integer"
    }
  },
  "required": [
    "reasoning",
    "answer"
  ],
  "title": "Response",
  "type": "object"
}

exact_match and json_answer_match are directly comparable. The reason grammar_compliance and json_validity are below 1 is because of the max_length parameter. Once the string reaches a certain length, generation process stops and json remains incomplete. json_answer_match result can probably be improved by using better schema.

ceferisbarov · 2025-08-26T17:05:57Z

@baberabb, @StellaAthena Hi! Do you have any feedback on this? We already use this fork internally and we can incorporate any changes you suggest.

ceferisbarov added 3 commits August 10, 2025 20:18

add hfstructuredlm

544e44d

add common structured generation metrics to the api

38a40de

add json_validity metric to the api

55668f1

ceferisbarov requested review from baberabb and StellaAthena as code owners August 12, 2025 18:16

ceferisbarov and others added 2 commits August 13, 2025 16:44

add dependencies for structured generation

ebaf9db

Merge branch 'EleutherAI:main' into structured-generation

bfe1608

Merge branch 'EleutherAI:main' into structured-generation

2c566ee

Merge branch 'EleutherAI:main' into structured-generation

df9200b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for Structured Generation with XGrammar #3232

Adding support for Structured Generation with XGrammar #3232

Uh oh!

ceferisbarov commented Aug 12, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Aug 12, 2025 •

edited

Loading

Uh oh!

stakodiak commented Aug 18, 2025

Uh oh!

ceferisbarov commented Aug 21, 2025

Uh oh!

ceferisbarov commented Aug 26, 2025

Uh oh!

Uh oh!

Adding support for Structured Generation with XGrammar #3232

Are you sure you want to change the base?

Adding support for Structured Generation with XGrammar #3232

Uh oh!

Conversation

ceferisbarov commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

A new model: HFStructuredLM

Changes to the API

Sample tasks

Dependencies

Plans

Uh oh!

CLAassistant commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stakodiak commented Aug 18, 2025

Uh oh!

ceferisbarov commented Aug 21, 2025

Uh oh!

ceferisbarov commented Aug 26, 2025

Uh oh!

Uh oh!

ceferisbarov commented Aug 12, 2025 •

edited

Loading

A new model: `HFStructuredLM`

CLAassistant commented Aug 12, 2025 •

edited

Loading