Open
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 7.6
- TensorFlow version and how it was installed (source or binary): TF 2.1, (IBM WML CE 1.7.3)
- TensorFlow-Addons version and how it was installed (source or binary): 0.9.1 source install
- Python version: 3.6.10
- Is GPU used? (yes/no): yes
Describe the bug
Using LAMB in a MultiWorkerMirroredStrategy fails with the ambiguous error message
2020-05-27 17:07:57.478805: F tensorflow/core/framework/tensor_shape.cc:345] Check failed: size >= 0 (-8648 vs. 0)
2020-05-27 17:08:01.205512: F tensorflow/core/framework/tensor_shape.cc:345] Check failed: size >= 0 (-8651 vs. 0)
I don't have a great concise code sample -- if there was a minimal MultiWorkerMirroredStrategy example somewhere I'd be happy to try it out.
But, the same code works
(1) in a single-node, 6-GPU MirroredStrategy
distribution (using LAMB)
(2) in a two-node, 12-GPU MultiWorkerMirroredStrategy
, using tfa.optimizers.AdamW.
Other info / logs
This is on a ppc64le system, run via an LSF queue.