Resolve DDP error by disabling gradients for lm_head #351
+2
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #164
Description
This PR addresses a
RuntimeError
that occurs during multi-GPU fine-tuning using DistributedDataParallel (DDP).Problem:
When running the training script on multiple GPUs, the following error is raised:
This happens because DDP expects all parameters with
requires_grad=True
to participate in the backward pass. However, during our fine-tuning process, thelanguage_model.lm_head
is not used in the loss calculation, and thus its parameters never receive gradients. DDP waits for these gradients indefinitely, causing the synchronization to fail and throw the error.Solution:
The fix is to explicitly disable gradient calculations for the
lm_head
by settingself.eagle_model.language_model.lm_head.requires_grad_(False)
. This informs DDP that it should not expect gradients for these parameters, resolving the synchronization issue.This approach is more efficient than setting
find_unused_parameters=True
in the DDP wrapper, as it avoids the overhead of searching for unused parameters in each iteration.Changes proposed in this pull request:
eagle_backbone.py
, setrequires_grad=False
for thelanguage_model.lm_head
to prevent DDP errors during multi-GPU training.Before submitting
section of the
CONTRIBUTING
docs.