-
Notifications
You must be signed in to change notification settings - Fork 50
Fix multi-GPU lit issue by using first detected agent #2031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run this on a machine that has two different GPU archs?
| config.features, config.arch_support_mfma, config.arch_support_wmma = get_arch_features(x) | ||
| config.substitutions.append(('%features', config.features)) | ||
| if not config.arch: | ||
| if agents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know here is probably not the place to do this, but is there somewhere where we can emit a warning to the user that we are only using one of the available archs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the best place for that is. This seems like it might make most sense to me:
https://github.com/ROCm/rocMLIR/blob/develop/mlir/test/common_utils/common.py#L58
|
I think we do use multiple GPUs when using MITuna, I think we can't do this. We need to figure out what GPU MITuna is using. |
This is related to systems that have multiple different GPUs. On MITuna it uses multiple identical GPUs, so I'm not sure how Tuna would behave with different GPUs. Since these changes only affect the lit config, they shouldn't impact Tuna in any way. |
| config.substitutions.append(('%features', config.features)) | ||
| if not config.arch: | ||
| if agents: | ||
| config.arch = sorted(agents)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we sorting? let's imagine the user sets HIP_VISIBLE_DEVICES="1,0", I guess HIP will return gpu 1 first? we want to run on that device, right?
Is there any guarantee we won't find different GPUs on the CI machines? I think at least we want to have a warning if that's the case. |
Motivation
Resolve: https://ontrack-internal.amd.com/browse/SWDEV-559813
Technical Details
This change modifies the lit configuration to select only the first detected GPU agent when initializing config.arch.
Previously, all agents were concatenated into a single string (e.g. "gfx1201,gfx1100"), which caused issues on systems with multiple different GPUs since our tools expect a single target.