Skip to content

Conversation

www-Ye
Copy link

@www-Ye www-Ye commented Sep 5, 2025

Fixes #164

Description

This PR addresses a RuntimeError that occurs during multi-GPU fine-tuning using DistributedDataParallel (DDP).

Problem:

When running the training script on multiple GPUs, the following error is raised:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.
...
Parameter indices which did not receive grad for rank 7: 134

This happens because DDP expects all parameters with requires_grad=True to participate in the backward pass. However, during our fine-tuning process, the language_model.lm_head is not used in the loss calculation, and thus its parameters never receive gradients. DDP waits for these gradients indefinitely, causing the synchronization to fail and throw the error.

Solution:

The fix is to explicitly disable gradient calculations for the lm_head by setting self.eagle_model.language_model.lm_head.requires_grad_(False). This informs DDP that it should not expect gradients for these parameters, resolving the synchronization issue.

This approach is more efficient than setting find_unused_parameters=True in the DDP wrapper, as it avoids the overhead of searching for unused parameters in each iteration.

Changes proposed in this pull request:

  • In eagle_backbone.py, set requires_grad=False for the language_model.lm_head to prevent DDP errors during multi-GPU training.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set ddp_find_unused_parameters=True will result in an error
1 participant