-
Notifications
You must be signed in to change notification settings - Fork 8
Implements latency test #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR. I will check it out later today (I understand it’s a draft so I may leave high level comments). You will have to exit gracefully when ranks are not 2 because CI will try all number of ranks. You can ignore the failing label action. |
You're most welcome, I just did it as two ranks for testing (as well as writing to the file), would it be better to have it as a matrix, similar to the way we do the bandwith test? |
A matrix would be nicer. |
459f636
to
dd7038d
Compare
1fbcf55
to
e72704a
Compare
tests/examples/test_load_latency.py
Outdated
else: | ||
while tl.load(flag, cache_modifier=".cv", volatile=True) != token_first_done: | ||
pass | ||
iris.put(data + offsets, data + offsets, curr_rank, peer_rank, heap_bases, mask=data_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about the results you are getting. Also, do you think the load and store API would be better here to avoid the local load and store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am unable to allocate 8 mi300x's at the moment on amd cloud. I agree with you on the load store part 👍
this maybe relevent, using nvbandwidth on H100 i get the following :
Running device_to_device_latency_sm.
Device to Device Latency SM GPU(row) <-> GPU(column) (ns)
0 1 2 3
0 N/A 549.34 545.77 545.35
1 550.28 N/A 660.95 659.08
2 548.35 547.22 N/A 544.26
3 545.49 831.96 543.86 N/A
the output from the trition code is:
R0 | R1 | R2 | R3 | |
---|---|---|---|---|
R0 | 0.000000 | 722.239990 | 701.119995 | 683.200012 |
R1 | 722.559998 | 0.000000 | 736.000000 | 727.679993 |
R2 | 701.440002 | 735.679993 | 0.000000 | 712.640015 |
R3 | 683.200012 | 727.679993 | 712.960022 | 0.000000 |
%error ((triton - nvband)*100/nvband):
R0 | R1 | R2 | R3 | |
---|---|---|---|---|
R0 | N/A | 31.47% | 28.46% | 25.28% |
R1 | 31.31% | N/A | 11.35% | 10.41% |
R2 | 27.92% | 34.44% | N/A | 30.94% |
R3 | 25.25% | -12.53% | 31.09% | N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting. How does it compare after using the load/store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, how did that even work? read_realtime
is using AMD GCN assembly. Did you change that to the equivalent PTX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I just tested this on MI300X and it seems to deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was what I initially had in mind for the microbenchmark. I am surprised you didn’t need to accumulate the result here. I remember everything was getting optimized away when we wrote similar code for the all load benchmark.
We wanted to add the cache modifiers and volatile arguments for a while but we haven’t yet. Let me think about this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised you didn’t need to accumulate the result here. I remember everything was getting optimized away when we wrote similar code for the all load benchmark.
yea without the cache modifier and volatile it gets optimized away
We wanted to add the cache modifiers and volatile arguments for a while but we haven’t yet. Let me think about this a bit more.
No worries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are not in nanoseconds, these are in clock cycles, yes?
See ISA.
We have been using this functions to find the clock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also push your PTX to the CUDA port branch as well. You can just comment out the CDNA assembly over there for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will do
this is the ptx i have been using:
@triton.jit
def read_realtime():
tmp = tl.inline_asm_elementwise(
asm="mov.u64 $0, %globaltimer;",
constraints=("=l"),
args=[],
dtype=tl.int64,
is_pure=False,
pack=1,
)
return tmp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. Just a couple of comments.
09668ba
to
bf9a37c
Compare
5a51b54
to
606853e
Compare
356c147
to
a3e9023
Compare
9910619
to
1509fd5
Compare
f53375a
to
b301385
Compare
Motivation
Closes #12
Technical Details
Test Plan
Test Result
Submission Checklist