Pmapped training loop slower on one GPU than other GPUs #29335

emorris7 · 2025-06-09T17:29:54Z

emorris7
Jun 9, 2025

Hi all,

I am fairly new to jax and am trying to use a version of pmap to perform simple parallelised training, splitting a batch across GPUs, using xarray datasets. However, when monitoring performance with the jax profiler and multiple GPUs, I see that one GPU seems to be significantly slower than the others which causes the other GPUs to block and wait before combining gradients for weight updates (see the profiler screenshot below where the purple ncclDevKernel_AllReduce_Sum blocks waiting for GPU0 to finish computation).

I am passing dummy, 'zero' data to the model so each GPU should be receiving the same data/computing the same gradients (which I've confirmed by inspecting the average gradient computed on each GPU).

Some things I've observed:

I've profiled the GPUs individually (running on a single GPU) without pmap and the single GPU speed is always as fast or faster than the slowest GPU speed that I observe in multi-GPU training (some of the GPUs appear to be faster than others when using a single GPU, but the 'slow' GPU across different multi-GPU runs always takes roughly the same amount of time).
I tried with a varying number of GPUs (2,3,4) and there always seems to be a 'slow' GPU. Before selecting GPUs I'm setting os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" so the ordering of the GPUs should be fixed. It thus seems that a GPU that is 'fast' in one run can become the 'slow' GPU in another run (e.g. GPUs=[1, 2, 3], 1 is slow, 2 and 3 are fast, GPUs=[2, 3], 2 slow, 3 is fast).
The slow GPU always has what appears to be a jax compilation sequence before starting the model training shown in the XLA ops section screenshotted below. This does not happen on the other GPUs however (this is probably expected behaviour but I'm still learning how to read the jax profiler tracer view).

I am quite confused as to why this speed difference exists and can't find anything in jax documentation to imply that this is normal behaviour. Any help would be much appreciated.

Versions:

jax=0.5.3
cuda=12.9
Using 4x NVIDIA A100-SXM4-40GB GPUs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pmapped training loop slower on one GPU than other GPUs #29335

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Pmapped training loop slower on one GPU than other GPUs #29335

Uh oh!

emorris7 Jun 9, 2025

Replies: 0 comments

emorris7
Jun 9, 2025