Skip to content

Conversation

astroC86
Copy link
Contributor

@astroC86 astroC86 commented Aug 21, 2025

Motivation

Closes #12

Technical Details

Test Plan

Test Result

Submission Checklist

@astroC86 astroC86 marked this pull request as draft August 21, 2025 11:32
@astroC86 astroC86 changed the title inital latency test Implements latency test Aug 21, 2025
@mawad-amd
Copy link
Collaborator

mawad-amd commented Aug 21, 2025

Thanks for the PR. I will check it out later today (I understand it’s a draft so I may leave high level comments). You will have to exit gracefully when ranks are not 2 because CI will try all number of ranks. You can ignore the failing label action.

@astroC86
Copy link
Contributor Author

astroC86 commented Aug 21, 2025

You're most welcome, I just did it as two ranks for testing (as well as writing to the file), would it be better to have it as a matrix, similar to the way we do the bandwith test?

@mawad-amd
Copy link
Collaborator

A matrix would be nicer.

@mawad-amd mawad-amd added benchmarks iris Iris project issue labels Aug 21, 2025
@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch from 459f636 to dd7038d Compare August 31, 2025 19:45
@astroC86 astroC86 marked this pull request as ready for review August 31, 2025 19:51
@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch from 1fbcf55 to e72704a Compare August 31, 2025 19:52
else:
while tl.load(flag, cache_modifier=".cv", volatile=True) != token_first_done:
pass
iris.put(data + offsets, data + offsets, curr_rank, peer_rank, heap_bases, mask=data_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the results you are getting. Also, do you think the load and store API would be better here to avoid the local load and store?

Copy link
Contributor Author

@astroC86 astroC86 Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am unable to allocate 8 mi300x's at the moment on amd cloud. I agree with you on the load store part 👍

this maybe relevent, using nvbandwidth on H100 i get the following :

Running device_to_device_latency_sm.
Device to Device Latency SM GPU(row) <-> GPU(column) (ns)
           0         1         2         3
 0       N/A    549.34    545.77    545.35
 1    550.28       N/A    660.95    659.08
 2    548.35    547.22       N/A    544.26
 3    545.49    831.96    543.86       N/A

the output from the trition code is:

R0 R1 R2 R3
R0 0.000000 722.239990 701.119995 683.200012
R1 722.559998 0.000000 736.000000 727.679993
R2 701.440002 735.679993 0.000000 712.640015
R3 683.200012 727.679993 712.960022 0.000000

%error ((triton - nvband)*100/nvband):

  R0 R1 R2 R3
R0 N/A 31.47% 28.46% 25.28%
R1 31.31% N/A 11.35% 10.41%
R2 27.92% 34.44% N/A 30.94%
R3 25.25% -12.53% 31.09% N/A

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting. How does it compare after using the load/store?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how did that even work? read_realtime is using AMD GCN assembly. Did you change that to the equivalent PTX?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just tested this on MI300X and it seems to deadlock.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was what I initially had in mind for the microbenchmark. I am surprised you didn’t need to accumulate the result here. I remember everything was getting optimized away when we wrote similar code for the all load benchmark.

We wanted to add the cache modifiers and volatile arguments for a while but we haven’t yet. Let me think about this a bit more.

Copy link
Contributor Author

@astroC86 astroC86 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised you didn’t need to accumulate the result here. I remember everything was getting optimized away when we wrote similar code for the all load benchmark.

yea without the cache modifier and volatile it gets optimized away

We wanted to add the cache modifiers and volatile arguments for a while but we haven’t yet. Let me think about this a bit more.

No worries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are not in nanoseconds, these are in clock cycles, yes?
See ISA.
image
We have been using this functions to find the clock.

Copy link
Collaborator

@mawad-amd mawad-amd Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also push your PTX to the CUDA port branch as well. You can just comment out the CDNA assembly over there for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will do
this is the ptx i have been using:

@triton.jit
def read_realtime():
    tmp = tl.inline_asm_elementwise(
        asm="mov.u64 $0, %globaltimer;",
        constraints=("=l"),
        args=[],
        dtype=tl.int64,
        is_pure=False,
        pack=1,
    )
    return tmp

Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. Just a couple of comments.

@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch from 09668ba to bf9a37c Compare September 1, 2025 02:16
@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch from 5a51b54 to 606853e Compare September 1, 2025 03:04
@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch from 356c147 to a3e9023 Compare September 1, 2025 04:27
@astroC86 astroC86 marked this pull request as draft September 2, 2025 01:24
@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch 3 times, most recently from 9910619 to 1509fd5 Compare September 6, 2025 19:02
@astroC86 astroC86 force-pushed the astroC86/load-store-latency branch from f53375a to b301385 Compare September 6, 2025 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks iris Iris project issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prepare for open source
2 participants