-
Notifications
You must be signed in to change notification settings - Fork 338
Add TRL GRPO Reasoning with Advanced Reward notebook #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add TRL GRPO Reasoning with Advanced Reward notebook #319
Conversation
This notebook demonstrates how to use TRL (Transformers Reinforcement Learning) with GRPO (Group Relative Policy Optimization) for reasoning tasks with advanced reward mechanisms. - Added notebook with proper lowercase filename - Updated _toctree.yml and index.md - Added proper author attribution - Cleaned non-informative outputs Contributed by: Behrooz Azarkhalili
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
- Remove torch and accelerate from installation (dependencies of TRL) - Remove pad token check (handled automatically) - Restore num_generations to default value (8) - Remove remove_unused_columns parameter (false by default) - Remove processing_class parameter (loaded automatically)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition! 😄
We already have a pretty similar example "Post training an LLM for reasoning with GRPO in TRL".
The idea of the repo is to have end-to-end recipes with extended explanations, so I'd suggest:
- Extending the explanations throughout the recipe of the example.
- Link the previous example and make a clear distinction between them, explaining it at the beginning. Otherwise, it could lead to confusion for a possible reader looking for an example of GRPO.
The recipes can be opened in Colab and maybe run, so I'd also be nice to keep that in mind. For example when doing os.environ["CUDA_VISIBLE_DEVICES"] = "1"
since in Colab there is only 1 GPU.
@@ -7,6 +7,7 @@ applications and solving various machine learning tasks using open-source tools | |||
|
|||
Check out the recently added notebooks: | |||
|
|||
- [TRL GRPO Reasoning with Advanced Reward](trl_grpo_reasoning_advanced_reward) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the last entry since we aim to have the last 5 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sergiopaniego, I just added the notes you mentioned. I hope the extension and the differences between the two versions make sense now! 😊
…O recipe - Add direct link to existing HuggingFace GRPO cookbook example - Fix CUDA device setting for Colab compatibility (auto-detect instead of hardcoded) - Add comprehensive explanations throughout all recipe sections - Enhance with detailed comparison table showing differences from basic example - Improve GPU setup with memory information and fallback instructions - Add detailed LoRA configuration explanations and parameter analysis - Expand dataset preparation with GSM8K background and format details - Detail multi-reward system design for mathematical reasoning approach - Optimize training configuration with Colab-specific memory settings - Enhance testing and evaluation with detailed response analysis - Make notebook fully end-to-end recipe focused for cookbook standards - Address all reviewer feedback comprehensively for cookbook contribution
@@ -0,0 +1,1452 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest some possible ideas for improving this section
- I'd reduce this section since it contains too much text. Instead, you can distribute the ideas where they're more suitable. For example, explaining the rewards functions in the section where you introduce them.
- I'd remove the section and subsections and only keep the title. If you want to add some relevant information, you can consider using bold style.
- The comparison against the other example contains so problems. For example, in the other example we have two reward functions but here you say it's only one. I'd suggest reviewing that.
Reply via ReviewNB
@@ -0,0 +1,1452 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code blocks include a lot of code without explanation. I'd suggest dividing them into meaningful subblocks and add some explanation. Let's think about the target audience (learner) :)
Reply via ReviewNB
@@ -0,0 +1,1452 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1452 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be nice if we could reduce this blocks a little, since they contain a lot of details. Are all the hyperparams needed?
Reply via ReviewNB
@@ -0,0 +1,1452 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explaining throughout the notebook the decisions made. Why do we need a callback (always think of a possible reader/learner)?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also resolve the conflicts with main? 😄
Summary
This notebook demonstrates GRPO (Group Relative Policy Optimization) fine-tuning for mathematical reasoning using TRL library with advanced reward mechanisms.
Key Features
Requirements Checklist
@merveenoyan @stevhliu
Contributed by: Behrooz Azarkhalili