Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

The VariableMgrDistributedReplicated decrease the speed of convergence #115

Open
@Sampson1107

Description

@Sampson1107

@reedwm

Hi, I am in trouble during using the following code.
""'
for i, (g, v) in enumerate(grads):
apply_gradient_op = opt.apply_gradients([(g, v)])
barrier = self.benchmark_cnn.add_sync_queues_and_barrier(
'replicate_variable_%s' % i, [apply_gradient_op])
"""
Here, the servers run the op "apply_gradient_op" one by one, but not average the servers's gradients. While the optimizer is Momentum, this will result of different update_value with the average value. In my application, the speed of convergence will slower and the training time will increase.
Is there any method to send an op that can add all the gradients and return the sum of all server's gradients? Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions