Potential hangup on AMD gpus with SDMA #2070
MartinGirard
started this conversation in
General
Replies: 1 comment
-
We have not seen this problem on Frontier. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been running simulations on MI300A GPUs. Occasionally, simulations simply hangup with no errors, sometimes after 10s of millions of MD steps. Attaching a debugger shows that it is stuck in a synchronization, i.e. waiting for device to finish.
My compute center has advised to set HSA_ENABLE_SDMA=0. The small tests I ran seem to indicate that this solves the problem, but since the bug is stochastic, this makes diagnostics very complicated.
Has this issue been seen elsewhere?
Beta Was this translation helpful? Give feedback.
All reactions