[WIP] [single_controller] feat: PyTorch Monarch integration #3713

keyan · 2025-10-09T20:51:33Z

What does this PR do?

Rough initial transfer of internal monarch integration.

Cleanup items:

clean up of many TODOs, some are more complex
internal infra usage like create_mast_proc_mesh needs to be removed
PPO trainer code was copy/pasted from Ray, it should be refactored to have a parent class to removed duplicate logic
PPO util methods like apply_kl_penalty should be moved to shared util module

Test

Experimental results available from post-training the Qwen-2.5-7B on H200 GPUs using Megatron-LM.

Experimental data pending wider release.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo_monarch \
    --config-path=config \
    --config-name='ppo_megatron_trainer.yaml' \
    ...

Design & Code Changes

TODO

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request introduces a significant integration with PyTorch Monarch, including a new PPO trainer and foundational classes for Monarch workers. The changes are extensive and, as noted in the description, are a work in progress. My review focuses on critical and high-severity issues that could cause runtime failures or limit the code's portability. I've identified several instances of hardcoded values that tie the implementation to specific hardware setups, a couple of bugs that would lead to crashes, a missing await in an async function, and a Python version compatibility issue. I have not commented on the cleanup items already listed in the pull request description.

gemini-code-assist · 2025-10-09T20:53:29Z

verl/single_controller/monarch/base.py

+        local_world_size = 8
+        local_rank = rank % local_world_size


The local_world_size is hardcoded to 8. This assumes that every node has 8 GPUs, which makes the code brittle and not portable to different hardware configurations. This value should be derived from the ProcMesh or environment configuration rather than being hardcoded.

Similar hardcoded values are found elsewhere in this file:

Line 149: super().__init__([4], 1) in MonarchResourcePool

Line 382-383: rank // 8 and rank % 8 in _execute_one_rank

These should all be parameterized or determined dynamically to ensure portability across different hardware setups.

gemini-code-assist · 2025-10-09T20:53:29Z

verl/trainer/ppo/monarch_trainer.py

+    num_hosts: int = 16,  # TODO: get the task_group size from MAST
+    num_gpus: int = 8,


The num_hosts and num_gpus are hardcoded. This makes the function not reusable for different cluster configurations. These should be passed as arguments or read from a configuration to make the code more flexible and portable.

gemini-code-assist · 2025-10-09T20:53:29Z

verl/trainer/ppo/monarch_trainer.py

+        gpus=num_gpus,
+    )
+
+    alloc = allocator.allocate(spec)


The create_mast_proc_mesh function is async, which suggests that allocator.allocate might be a coroutine. If so, it should be awaited: alloc = await allocator.allocate(spec). Otherwise, it will not execute correctly and will likely result in a runtime error.

Suggested change

alloc = allocator.allocate(spec)

alloc = await allocator.allocate(spec)

gemini-code-assist · 2025-10-09T20:53:29Z

verl/trainer/ppo/monarch_trainer.py

+            "val_before_train", True
+        ):
+            val_metrics = self._validate()
+            assert val_metrics, f"{val_metrics=}"


The _validate method can return an empty dictionary (e.g., on line 813), which will cause this assertion to fail and crash the training process. The assertion should be removed or the logic should be changed to handle an empty val_metrics dictionary gracefully, for example by skipping logging if it's empty.

gemini-code-assist · 2025-10-09T20:53:29Z

verl/trainer/ppo/monarch_trainer.py

+                        if self.config.reward_model.launch_reward_fn_async:
+                            future_reward = compute_reward_async.remote(
+                                batch, self.config, self.tokenizer
+                            )
+                        else:
+                            reward_tensor, reward_extra_infos_dict = compute_reward(
+                                batch, self.reward_fn
+                            )


When self.config.reward_model.launch_reward_fn_async is true, future_reward is created but never awaited. The code then proceeds to use reward_tensor and reward_extra_infos_dict, which are only defined in the else block. This will lead to a NameError. The async path seems incomplete and will crash. Although a NotImplementedError is raised later on line 1502 for this path, the crash will happen before that.

gemini-code-assist · 2025-10-09T20:53:29Z

verl/single_controller/monarch/base.py

+                    # only valid when Python >= 3.9
+                    original_method_name = method_name.removeprefix(prefix)


The removeprefix string method was introduced in Python 3.9. While the comment acknowledges this, it's better to use a backward-compatible alternative to avoid runtime errors on older Python versions. You can replace it with string slicing.

Suggested change

# only valid when Python >= 3.9

original_method_name = method_name.removeprefix(prefix)

original_method_name = method_name[len(prefix):]

keyan added 2 commits October 9, 2025 16:35

Copy over primary files

87479d9

Add tests

05d98df

keyan requested review from PeterSH6, eric-haibin-lin, tongyx361, vermouth1992, wuxibin89 and zw0610 as code owners October 9, 2025 20:51

keyan marked this pull request as draft October 9, 2025 20:51

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [single_controller] feat: PyTorch Monarch integration #3713

[WIP] [single_controller] feat: PyTorch Monarch integration #3713

keyan commented Oct 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		num_hosts: int = 16, # TODO: get the task_group size from MAST
		num_gpus: int = 8,

	alloc = allocator.allocate(spec)
	alloc = await allocator.allocate(spec)

		# only valid when Python >= 3.9
		original_method_name = method_name.removeprefix(prefix)

	# only valid when Python >= 3.9
	original_method_name = method_name.removeprefix(prefix)
	original_method_name = method_name[len(prefix):]

[WIP] [single_controller] feat: PyTorch Monarch integration #3713

Are you sure you want to change the base?

[WIP] [single_controller] feat: PyTorch Monarch integration #3713

Conversation

keyan commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

keyan commented Oct 9, 2025 •

edited

Loading