Skip to content

LAMB optimizer fails in MultiWorkerMirroredStrategy #1896

Open
@pstjohn

Description

@pstjohn

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 7.6
  • TensorFlow version and how it was installed (source or binary): TF 2.1, (IBM WML CE 1.7.3)
  • TensorFlow-Addons version and how it was installed (source or binary): 0.9.1 source install
  • Python version: 3.6.10
  • Is GPU used? (yes/no): yes

Describe the bug
Using LAMB in a MultiWorkerMirroredStrategy fails with the ambiguous error message

2020-05-27 17:07:57.478805: F tensorflow/core/framework/tensor_shape.cc:345] Check failed: size >= 0 (-8648 vs. 0)
2020-05-27 17:08:01.205512: F tensorflow/core/framework/tensor_shape.cc:345] Check failed: size >= 0 (-8651 vs. 0)

I don't have a great concise code sample -- if there was a minimal MultiWorkerMirroredStrategy example somewhere I'd be happy to try it out.

But, the same code works
(1) in a single-node, 6-GPU MirroredStrategy distribution (using LAMB)
(2) in a two-node, 12-GPU MultiWorkerMirroredStrategy, using tfa.optimizers.AdamW.

Other info / logs

This is on a ppc64le system, run via an LSF queue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions