Skip to content

Conversation

ceferisbarov
Copy link
Contributor

@ceferisbarov ceferisbarov commented Aug 12, 2025

This PR adds support for structured generation (SG) to lm-evalution-harness. @Saibo-creator and I have been working on this direction for a few weeks, and we already use our fork for internal experiments.

Motivation

SG is an established feature of most major LLM engines and it is used widely in both agentic systems and various reasoning tasks. There are alreay benchmarks that are intended to be used with SG (see JSONSchemaBench, StructEval). It is also possible (and an intereting area of exploration) to run other benchmarks with SG. We have made certain changes to lm-evaluation-harness to create such tasks easily.

A new model: HFStructuredLM

This PR adds a new model, HFStructuredLM that supports SG with XGrammar engine. It accepts three different grammars: json, regex, and GBNF. Two task setups are possible:

  • A separate grammar for each record: Add a separate grammar column to the dataset.
  • A common grammar for the entire task: create a grammar file.

Changes to the API

We also add three new metrics:

  • json_validity: Given a JSON string, returns 1 if the JSON is valid, 0 if not.
  • grammar_compliance: Given a grammar (JSON, regex, or GBNF) string, return 1 if the prediction matches the grammar, 0 if not.
  • json_answer_match: Given a JSON string and the target field, performs exact match.

Sample tasks

We have extended the existing GSM8K task as a working sample of SG with JSON output. The following GitHub Gist contains two files:

https://gist.github.com/ceferisbarov/88b85b0423486807f5b633ab4bf9baaa

You can place them in lm_eval/tasks/gsm8k or another task folder and test SG.

Dependencies

We have added two new dependencies jsonschema and xgrammar. We can also add these as an extra.

Plans

This is an ongoing work and we appreciate any feedback. Our backlog so far:

  • json_answer_match metric is an ad hoc solution. We need a more generalizable alternative. We can extend exact_match metric by adding new parameters.
  • grammar_compliance metric also contains ad hoc solutions, especially for JSON. We can make grammar_compliance an extensible metric where users import it to their task's metrics.py file and modify it.
  • Add documentation & tutorials.
  • Add other SG engines (we added only XGrammar for now).
  • Update jsonschemabench task to use default metrics instead of custom ones.

@CLAassistant
Copy link

CLAassistant commented Aug 12, 2025

CLA assistant check
All committers have signed the CLA.

@stakodiak
Copy link
Contributor

I would be interested to see a comparison of changes in scores

@ceferisbarov
Copy link
Contributor Author

@stakodiak

Here is a simple experiment:
Model: Qwen3-8B, full precision
Dataset: GSM8K, test set

No structured generation (original set-up):

|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.8855|±  |0.0088|
|         |       |strict-match    |     8|exact_match|↑  |0.8840|±  |0.0088|

Structured generation:

|         Tasks         |Version|Filter|n-shot|      Metric      |   |Value |   |Stderr|
|-----------------------|------:|------|-----:|------------------|---|-----:|---|-----:|
|gsm8k-enforced-sampling|      3|none  |     0|grammar_compliance|↑  |0.9454|±  |0.0062|
|                       |       |none  |     0|json_answer_match |↑  |0.6558|±  |0.0131|
|                       |       |none  |     0|json_validity     |↑  |0.9454|±  |0.0062|

We used the following JSON schema:

{
  "properties": {
    "reasoning": {
      "title": "Reasoning",
      "type": "string"
    },
    "answer": {
      "title": "Answer",
      "type": "integer"
    }
  },
  "required": [
    "reasoning",
    "answer"
  ],
  "title": "Response",
  "type": "object"
}

exact_match and json_answer_match are directly comparable. The reason grammar_compliance and json_validity are below 1 is because of the max_length parameter. Once the string reaches a certain length, generation process stops and json remains incomplete. json_answer_match result can probably be improved by using better schema.

@ceferisbarov
Copy link
Contributor Author

@baberabb, @StellaAthena Hi! Do you have any feedback on this? We already use this fork internally and we can incorporate any changes you suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants