Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions verl/trainer/ppo/ray_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1105,9 +1105,7 @@ def fit(self):
batch = batch.union(reward_tensor)

if self.config.reward_model.launch_reward_fn_async:
future_reward = compute_reward_async.remote(
data=batch, config=self.config, tokenizer=self.tokenizer
)
future_reward = compute_reward_async.remote(data=batch, reward_fn=self.reward_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This revert re-introduces a critical serialization issue. The self.reward_fn object can be non-serializable by Ray, especially when sandbox_fusion is enabled, as it may contain a multiprocessing.Semaphore (see verl/trainer/ppo/reward.py:135). Passing a non-serializable object to a remote function will cause the training to crash with a SerializationError.

The previous implementation, although using a deprecated pattern, correctly avoided this issue by reconstructing the reward_fn inside the remote worker. Reverting this fix without addressing the underlying serializability of self.reward_fn is a regression.

A more robust solution would be to ensure reward_fn is always serializable, for example by using ray.util.concurrency.Semaphore instead of multiprocessing.Semaphore. However, in the absence of that fix, the previous code is safer.

Suggested change
future_reward = compute_reward_async.remote(data=batch, reward_fn=self.reward_fn)
future_reward = compute_reward_async.remote(
data=batch, config=self.config, tokenizer=self.tokenizer
)

else:
reward_tensor, reward_extra_infos_dict = compute_reward(batch, self.reward_fn)

Expand Down
Loading