This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
The VariableMgrDistributedReplicated decrease the speed of convergence #115
Open
Description
Hi, I am in trouble during using the following code.
""'
for i, (g, v) in enumerate(grads):
apply_gradient_op = opt.apply_gradients([(g, v)])
barrier = self.benchmark_cnn.add_sync_queues_and_barrier(
'replicate_variable_%s' % i, [apply_gradient_op])
"""
Here, the servers run the op "apply_gradient_op" one by one, but not average the servers's gradients. While the optimizer is Momentum, this will result of different update_value with the average value. In my application, the speed of convergence will slower and the training time will increase.
Is there any method to send an op that can add all the gradients and return the sum of all server's gradients? Thanks a lot!