Support for Models With Pre-Finetuned LoRA Adapters in GRPO: Add use_peft_as_reference Flag #3196

LoganVegnaSHOP · 2025-03-31T15:42:11Z

What does this PR do?

This PR introduces a new flag, use_peft_as_reference, to the GRPO configuration and trainer. When using the GRPO tuner with PEFT models (e.g. LoRA or quantized models), the trainer by default creates the reference model by disabling the adapter. This behavior, however, is undesirable when you are working with a model that has already been fine-tuned using LoRA weights—the reference model should mirror the full model (including the adapter) to avoid unwanted divergence.
With the new use_peft_as_reference flag, if set to True, the reference model is created directly using the full PEFT model (via create_reference_model), thereby retaining the fine-tuned adapter weights (or quantization) in the reference. This adjustment ensures a closer match between the policy and reference models, which translates into more stable KL divergence during training.
Additionally, the PR adds tests to verify that when the flag is enabled, the ref_model is properly set.

Fixes #3194

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec · 2025-03-31T16:24:08Z

Thanks @LoganVegnaSHOP. I wonder in what situation one would want this flag to be set False?

tests/test_grpo_trainer.py

LoganVegnaSHOP · 2025-03-31T16:48:39Z

Thanks @LoganVegnaSHOP. I wonder in what situation one would want this flag to be set False?

If you are initializing brand new LoRA adapters it would be more efficient to set the flag to false since the ref model would then be slightly smaller by excluding the lora weights. @qgallouedec

Co-authored-by: Quentin Gallouédec <[email protected]>

LoganVegnaSHOP added 2 commits March 31, 2025 11:12

add use peft reference argument + test

a3af10c

fix num_generations for peft ref flag test

dd14b74

qgallouedec reviewed Mar 31, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Mar 31, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Mar 31, 2025

View reviewed changes

tests/test_grpo_trainer.py Show resolved Hide resolved

LoganVegnaSHOP and others added 2 commits March 31, 2025 12:55

Apply suggestions from code review

fb5aa14

Co-authored-by: Quentin Gallouédec <[email protected]>

update test_peft_use_as_reference_flag_true to make intent clear

6d9c649

LoganVegnaSHOP requested a review from qgallouedec March 31, 2025 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Models With Pre-Finetuned LoRA Adapters in GRPO: Add use_peft_as_reference Flag #3196

Support for Models With Pre-Finetuned LoRA Adapters in GRPO: Add use_peft_as_reference Flag #3196

LoganVegnaSHOP commented Mar 31, 2025 •

edited

Loading

qgallouedec commented Mar 31, 2025

LoganVegnaSHOP commented Mar 31, 2025 •

edited

Loading

Support for Models With Pre-Finetuned LoRA Adapters in GRPO: Add use_peft_as_reference Flag #3196

Are you sure you want to change the base?

Support for Models With Pre-Finetuned LoRA Adapters in GRPO: Add use_peft_as_reference Flag #3196

Conversation

LoganVegnaSHOP commented Mar 31, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

qgallouedec commented Mar 31, 2025

LoganVegnaSHOP commented Mar 31, 2025 • edited Loading

LoganVegnaSHOP commented Mar 31, 2025 •

edited

Loading

LoganVegnaSHOP commented Mar 31, 2025 •

edited

Loading