Skip to content

Support for Models With Pre-Finetuned LoRA Adapters in GRPO: Add use_peft_as_reference Flag #3196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

LoganVegnaSHOP
Copy link

@LoganVegnaSHOP LoganVegnaSHOP commented Mar 31, 2025

What does this PR do?

This PR introduces a new flag, use_peft_as_reference, to the GRPO configuration and trainer. When using the GRPO tuner with PEFT models (e.g. LoRA or quantized models), the trainer by default creates the reference model by disabling the adapter. This behavior, however, is undesirable when you are working with a model that has already been fine-tuned using LoRA weights—the reference model should mirror the full model (including the adapter) to avoid unwanted divergence.
With the new use_peft_as_reference flag, if set to True, the reference model is created directly using the full PEFT model (via create_reference_model), thereby retaining the fine-tuned adapter weights (or quantization) in the reference. This adjustment ensures a closer match between the policy and reference models, which translates into more stable KL divergence during training.
Additionally, the PR adds tests to verify that when the flag is enabled, the ref_model is properly set.

Fixes #3194

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@qgallouedec
Copy link
Member

Thanks @LoganVegnaSHOP. I wonder in what situation one would want this flag to be set False?

@LoganVegnaSHOP
Copy link
Author

LoganVegnaSHOP commented Mar 31, 2025

Thanks @LoganVegnaSHOP. I wonder in what situation one would want this flag to be set False?

If you are initializing brand new LoRA adapters it would be more efficient to set the flag to false since the ref model would then be slightly smaller by excluding the lora weights. @qgallouedec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GRPO with pre-finetuned LoRA model as reference
2 participants