Upgraded based on review

sergiopaniego · sergiopaniego · commit eda3e7c4dba7 · 2025-07-30T16:04:29.000+02:00
diff --git a/notebooks/en/fine_tuning_vlm_grpo_trl.ipynb b/notebooks/en/fine_tuning_vlm_grpo_trl.ipynb
@@ -82,7 +82,7 @@
         "\n",
         "We leverage [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) for this recipe. This dataset contains 8k multimodal RL training examples focused on math reasoning. This data was created using GPT4o and includes `image`, `problem`, `solution`, `original question` and `original answer` for each sample. It was created in [this project](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal).\n",
         "\n",
-        "For our particular case where we want the model to learn to reason using images, we use as input `image` and `problem` and as output `solution` columns.\n",
+        "For our particular case where we want the model to learn to reason using images, we use `image` and  `problem` as input and `solution` as output.\n",
         "\n",
         "For this educational resource, we'll only use 5% of the dataset and divide it into train and test sets to make it faster to train. In a real training, we'd use the full dataset.\n",
         "\n",
@@ -172,7 +172,7 @@
         "\n",
         "We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.\n",
         "\n",
-        "We also set `padding_side=\"left\"` to ensure that generated completions during trainig are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
+        "We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
       ],
       "metadata": {
         "id": "6isapXWue91d"
@@ -495,7 +495,7 @@
         "\n",
         "\n",
         "\n",
-        "For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. You can find more details [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py). We can simply define and implement these reward functions as generic Python functions.\n",
+        "For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model that evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. You can find more details [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py). We can simply define and implement these reward functions as generic Python functions.\n",
         "\n",
         "In this case, we will utilize the following reward functions, directly extracted from the Open R1 [implementation](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py):\n",
         "\n",
@@ -596,7 +596,9 @@
         "\n",
         "Next, let's configure the training parameters for GRPO. We recommend experimenting with the `max_completion_length`, `num_generations`, and `max_prompt_length` parameters.\n",
         "\n",
-        "It'd be interesting to play with the `max_completion_length`, `num_generations`, and `max_prompt_length` params in order to find the best traininig combination."
+        "It'd be interesting to play with the `max_completion_length`, `num_generations`, and `max_prompt_length` params in order to find the best training combination.\n",
+        "\n",
+        "The parameter selection has been adjusted to fit within the hardware limitations of a Google Colab session. To observe the full potential of reward improvements, especially in the second objective function, and to further improve the model's reasoning capabilities in a real-world scenario, a more ambitious setup would be required. This would involve larger models, an increased number of generations, and a high-quality, diverse dataset."
       ],
       "metadata": {
         "id": "qW_3r8T1EtNg"
@@ -876,7 +878,9 @@
     {
       "cell_type": "markdown",
       "source": [
-        "Let's save the results in our account 💾"
+        "We can review the training metrics directly in TensorBoard on the [model page]((https://huggingface.co/sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking/tensorboard). While the loss curve might look a bit off, the reward results tell a clearer story: the model steadily improves, increasing the amount of reward it receives over time.\n",
+        "\n",
+        "Now, let's save the results in our account 💾"
       ],
       "metadata": {
         "id": "z7_y1x7E1JY9"