Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reward modeling in torchtune RFC
Core issues
Proposal
Proposal: Close these gaps using the tools that are already presented in torchtune.
Custom loss
We need a special loss for the reward modeling (it comes from the Bradley-Terry model):
Where$r_1$ and $r_2$ are chosen and rejected rewards, respectively.
How to implement in torchtune: Basically just
-F.logsigmoid(rewards_chosen - rewards_rejected).mean()
. Probably one more file in therlhf
directory related to the reward modeling. It is important to make it flexible enough to make it possible to train with different objectives.Reward centering
In many scenarios, it’s preferable to ensure that a reward model’s output is mean zero. This is often done by first calculating the model’s average score and then subtracting it.
https://arxiv.org/abs/2312.09244 introduces an auxiliary loss function designed to directly learn a centered reward model. This auxiliary loss minimizes the squared sum of the rewards, encouraging the model to naturally produce mean-zero outputs:
This component is added to the main loss with some weighting coefficient$\omega$
How to implement in torchtune: basically just add this component to the loss:
centering_coefficient torch.mean((rewards_chosen + rewards_rejected) * 2)
Margin to the loss
It might be efficient to calculate margin from the rewards and add it to the BT loss (similarly to how it's done in llama papers); basically, it just requires an extra column in the dataset and a simple calculation.
How to implement in torchtune: We might need a custom reward modeling dataset in torchtune: with
margin
column and without a prompt. Then, just simple-margin
from the rewards in the common loss.Generation recipe utilization
This is the interesting one. We need to make it possible to do two things for the reward modeling:
PreferenceTransform
? Given a dataset with prompt, response1, and response2. Transform it into: chosen, rejected, margin. Where margin is calculated through theBinary -> Embeddings representation of preferences
There is a way to infer multidimensional human preferences through some tricky PCA to identify orthogonal basis vectors, each
capturing a distinct human preference direction: https://arxiv.org/pdf/2502.13131
How to implement in torchtune: It will require a separate recipe, PCA, and separate dataset.
General thoughts
Except for the last idea, we might eliminate the requirements of the separate recipe, but we need to create a new dataset type inheriting from PreferenceDataset (maybe some extra abstraction here?), loss, and transform.
Within this, the only thing that users might want to touch in configs to enable reward modeling is a loss section; basically, it might look like
We have strong evidence that cross-prompt modeling acts better than same-prompt, so I assume that we need to introduce it directly in torchtune, while some features might be delegated to users.
We might also want to introduce packing because of the possible size of the reward modeling datasets, but it is still not really trivial for the preference datasets.