|
82 | 82 | "\n", |
83 | 83 | "We leverage [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) for this recipe. This dataset contains 8k multimodal RL training examples focused on math reasoning. This data was created using GPT4o and includes `image`, `problem`, `solution`, `original question` and `original answer` for each sample. It was created in [this project](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal).\n", |
84 | 84 | "\n", |
85 | | - "For our particular case where we want the model to learn to reason using images, we use as input `image` and `problem` and as output `solution` columns.\n", |
| 85 | + "For our particular case where we want the model to learn to reason using images, we use `image` and `problem` as input and `solution` as output.\n", |
86 | 86 | "\n", |
87 | 87 | "For this educational resource, we'll only use 5% of the dataset and divide it into train and test sets to make it faster to train. In a real training, we'd use the full dataset.\n", |
88 | 88 | "\n", |
|
172 | 172 | "\n", |
173 | 173 | "We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.\n", |
174 | 174 | "\n", |
175 | | - "We also set `padding_side=\"left\"` to ensure that generated completions during trainig are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses." |
| 175 | + "We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses." |
176 | 176 | ], |
177 | 177 | "metadata": { |
178 | 178 | "id": "6isapXWue91d" |
|
495 | 495 | "\n", |
496 | 496 | "\n", |
497 | 497 | "\n", |
498 | | - "For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. You can find more details [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py). We can simply define and implement these reward functions as generic Python functions.\n", |
| 498 | + "For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model that evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. You can find more details [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py). We can simply define and implement these reward functions as generic Python functions.\n", |
499 | 499 | "\n", |
500 | 500 | "In this case, we will utilize the following reward functions, directly extracted from the Open R1 [implementation](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py):\n", |
501 | 501 | "\n", |
|
596 | 596 | "\n", |
597 | 597 | "Next, let's configure the training parameters for GRPO. We recommend experimenting with the `max_completion_length`, `num_generations`, and `max_prompt_length` parameters.\n", |
598 | 598 | "\n", |
599 | | - "It'd be interesting to play with the `max_completion_length`, `num_generations`, and `max_prompt_length` params in order to find the best traininig combination." |
| 599 | + "It'd be interesting to play with the `max_completion_length`, `num_generations`, and `max_prompt_length` params in order to find the best training combination.\n", |
| 600 | + "\n", |
| 601 | + "The parameter selection has been adjusted to fit within the hardware limitations of a Google Colab session. To observe the full potential of reward improvements, especially in the second objective function, and to further improve the model's reasoning capabilities in a real-world scenario, a more ambitious setup would be required. This would involve larger models, an increased number of generations, and a high-quality, diverse dataset." |
600 | 602 | ], |
601 | 603 | "metadata": { |
602 | 604 | "id": "qW_3r8T1EtNg" |
|
876 | 878 | { |
877 | 879 | "cell_type": "markdown", |
878 | 880 | "source": [ |
879 | | - "Let's save the results in our account 💾" |
| 881 | + "We can review the training metrics directly in TensorBoard on the [model page]((https://huggingface.co/sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking/tensorboard). While the loss curve might look a bit off, the reward results tell a clearer story: the model steadily improves, increasing the amount of reward it receives over time.\n", |
| 882 | + "\n", |
| 883 | + "Now, let's save the results in our account 💾" |
880 | 884 | ], |
881 | 885 | "metadata": { |
882 | 886 | "id": "z7_y1x7E1JY9" |
|
0 commit comments