Skip to content

Conversation

@keyan
Copy link

@keyan keyan commented Oct 9, 2025

What does this PR do?

Rough initial transfer of internal monarch integration.

Cleanup items:

  • clean up of many TODOs, some are more complex
  • internal infra usage like create_mast_proc_mesh needs to be removed
  • PPO trainer code was copy/pasted from Ray, it should be refactored to have a parent class to removed duplicate logic
  • PPO util methods like apply_kl_penalty should be moved to shared util module

Test

Experimental results available from post-training the Qwen-2.5-7B on H200 GPUs using Megatron-LM.

Experimental data pending wider release.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo_monarch \
    --config-path=config \
    --config-name='ppo_megatron_trainer.yaml' \
    ...

Design & Code Changes

TODO

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant integration with PyTorch Monarch, including a new PPO trainer and foundational classes for Monarch workers. The changes are extensive and, as noted in the description, are a work in progress. My review focuses on critical and high-severity issues that could cause runtime failures or limit the code's portability. I've identified several instances of hardcoded values that tie the implementation to specific hardware setups, a couple of bugs that would lead to crashes, a missing await in an async function, and a Python version compatibility issue. I have not commented on the cleanup items already listed in the pull request description.

Comment on lines +165 to +166
local_world_size = 8
local_rank = rank % local_world_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The local_world_size is hardcoded to 8. This assumes that every node has 8 GPUs, which makes the code brittle and not portable to different hardware configurations. This value should be derived from the ProcMesh or environment configuration rather than being hardcoded.

Similar hardcoded values are found elsewhere in this file:

  • Line 149: super().__init__([4], 1) in MonarchResourcePool
  • Line 382-383: rank // 8 and rank % 8 in _execute_one_rank

These should all be parameterized or determined dynamically to ensure portability across different hardware setups.

Comment on lines +104 to +105
num_hosts: int = 16, # TODO: get the task_group size from MAST
num_gpus: int = 8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The num_hosts and num_gpus are hardcoded. This makes the function not reusable for different cluster configurations. These should be passed as arguments or read from a configuration to make the code more flexible and portable.

gpus=num_gpus,
)

alloc = allocator.allocate(spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The create_mast_proc_mesh function is async, which suggests that allocator.allocate might be a coroutine. If so, it should be awaited: alloc = await allocator.allocate(spec). Otherwise, it will not execute correctly and will likely result in a runtime error.

Suggested change
alloc = allocator.allocate(spec)
alloc = await allocator.allocate(spec)

"val_before_train", True
):
val_metrics = self._validate()
assert val_metrics, f"{val_metrics=}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _validate method can return an empty dictionary (e.g., on line 813), which will cause this assertion to fail and crash the training process. The assertion should be removed or the logic should be changed to handle an empty val_metrics dictionary gracefully, for example by skipping logging if it's empty.

Comment on lines +1424 to +1431
if self.config.reward_model.launch_reward_fn_async:
future_reward = compute_reward_async.remote(
batch, self.config, self.tokenizer
)
else:
reward_tensor, reward_extra_infos_dict = compute_reward(
batch, self.reward_fn
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When self.config.reward_model.launch_reward_fn_async is true, future_reward is created but never awaited. The code then proceeds to use reward_tensor and reward_extra_infos_dict, which are only defined in the else block. This will lead to a NameError. The async path seems incomplete and will crash. Although a NotImplementedError is raised later on line 1502 for this path, the crash will happen before that.

Comment on lines +523 to +524
# only valid when Python >= 3.9
original_method_name = method_name.removeprefix(prefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removeprefix string method was introduced in Python 3.9. While the comment acknowledges this, it's better to use a backward-compatible alternative to avoid runtime errors on older Python versions. You can replace it with string slicing.

Suggested change
# only valid when Python >= 3.9
original_method_name = method_name.removeprefix(prefix)
original_method_name = method_name[len(prefix):]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant