Skip to content

Commit eda3e7c

Browse files
committed
Upgraded based on review
1 parent 0789a31 commit eda3e7c

File tree

1 file changed

+9
-5
lines changed

1 file changed

+9
-5
lines changed

notebooks/en/fine_tuning_vlm_grpo_trl.ipynb

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@
8282
"\n",
8383
"We leverage [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) for this recipe. This dataset contains 8k multimodal RL training examples focused on math reasoning. This data was created using GPT4o and includes `image`, `problem`, `solution`, `original question` and `original answer` for each sample. It was created in [this project](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal).\n",
8484
"\n",
85-
"For our particular case where we want the model to learn to reason using images, we use as input `image` and `problem` and as output `solution` columns.\n",
85+
"For our particular case where we want the model to learn to reason using images, we use `image` and `problem` as input and `solution` as output.\n",
8686
"\n",
8787
"For this educational resource, we'll only use 5% of the dataset and divide it into train and test sets to make it faster to train. In a real training, we'd use the full dataset.\n",
8888
"\n",
@@ -172,7 +172,7 @@
172172
"\n",
173173
"We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.\n",
174174
"\n",
175-
"We also set `padding_side=\"left\"` to ensure that generated completions during trainig are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
175+
"We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
176176
],
177177
"metadata": {
178178
"id": "6isapXWue91d"
@@ -495,7 +495,7 @@
495495
"\n",
496496
"\n",
497497
"\n",
498-
"For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. You can find more details [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py). We can simply define and implement these reward functions as generic Python functions.\n",
498+
"For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model that evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. You can find more details [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py). We can simply define and implement these reward functions as generic Python functions.\n",
499499
"\n",
500500
"In this case, we will utilize the following reward functions, directly extracted from the Open R1 [implementation](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py):\n",
501501
"\n",
@@ -596,7 +596,9 @@
596596
"\n",
597597
"Next, let's configure the training parameters for GRPO. We recommend experimenting with the `max_completion_length`, `num_generations`, and `max_prompt_length` parameters.\n",
598598
"\n",
599-
"It'd be interesting to play with the `max_completion_length`, `num_generations`, and `max_prompt_length` params in order to find the best traininig combination."
599+
"It'd be interesting to play with the `max_completion_length`, `num_generations`, and `max_prompt_length` params in order to find the best training combination.\n",
600+
"\n",
601+
"The parameter selection has been adjusted to fit within the hardware limitations of a Google Colab session. To observe the full potential of reward improvements, especially in the second objective function, and to further improve the model's reasoning capabilities in a real-world scenario, a more ambitious setup would be required. This would involve larger models, an increased number of generations, and a high-quality, diverse dataset."
600602
],
601603
"metadata": {
602604
"id": "qW_3r8T1EtNg"
@@ -876,7 +878,9 @@
876878
{
877879
"cell_type": "markdown",
878880
"source": [
879-
"Let's save the results in our account 💾"
881+
"We can review the training metrics directly in TensorBoard on the [model page]((https://huggingface.co/sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking/tensorboard). While the loss curve might look a bit off, the reward results tell a clearer story: the model steadily improves, increasing the amount of reward it receives over time.\n",
882+
"\n",
883+
"Now, let's save the results in our account 💾"
880884
],
881885
"metadata": {
882886
"id": "z7_y1x7E1JY9"

0 commit comments

Comments
 (0)