-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Adding support for Structured Generation with XGrammar #3232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I would be interested to see a comparison of changes in scores |
Here is a simple experiment: No structured generation (original set-up):
Structured generation:
We used the following JSON schema:
|
@baberabb, @StellaAthena Hi! Do you have any feedback on this? We already use this fork internally and we can incorporate any changes you suggest. |
This PR adds support for structured generation (SG) to lm-evalution-harness. @Saibo-creator and I have been working on this direction for a few weeks, and we already use our fork for internal experiments.
Motivation
SG is an established feature of most major LLM engines and it is used widely in both agentic systems and various reasoning tasks. There are alreay benchmarks that are intended to be used with SG (see JSONSchemaBench, StructEval). It is also possible (and an intereting area of exploration) to run other benchmarks with SG. We have made certain changes to lm-evaluation-harness to create such tasks easily.
A new model:
HFStructuredLM
This PR adds a new model,
HFStructuredLM
that supports SG with XGrammar engine. It accepts three different grammars: json, regex, and GBNF. Two task setups are possible:grammar
column to the dataset.Changes to the API
We also add three new metrics:
json_validity
: Given a JSON string, returns 1 if the JSON is valid, 0 if not.grammar_compliance
: Given a grammar (JSON, regex, or GBNF) string, return 1 if the prediction matches the grammar, 0 if not.json_answer_match
: Given a JSON string and the target field, performs exact match.Sample tasks
We have extended the existing GSM8K task as a working sample of SG with JSON output. The following GitHub Gist contains two files:
https://gist.github.com/ceferisbarov/88b85b0423486807f5b633ab4bf9baaa
You can place them in
lm_eval/tasks/gsm8k
or another task folder and test SG.Dependencies
We have added two new dependencies
jsonschema
andxgrammar
. We can also add these as an extra.Plans
This is an ongoing work and we appreciate any feedback. Our backlog so far:
json_answer_match
metric is an ad hoc solution. We need a more generalizable alternative. We can extendexact_match
metric by adding new parameters.grammar_compliance
metric also contains ad hoc solutions, especially for JSON. We can makegrammar_compliance
an extensible metric where users import it to their task'smetrics.py
file and modify it.