Skip to content

Conversation

@eky-amd
Copy link

@eky-amd eky-amd commented Oct 20, 2025

Changes

Wrote Gluon implementation of online softmax kernel, found in aiter/ops/triton/gluon/softmax.py.

  • Gluon kernel mostly uses same parameters as Triton kernel (i.e. num_warps, waves_per_eu).
  • Triton kernel in aiter/ops/triton/softmax.py was modified so that the num_stages parameter is correctly passed into the tl.range loops.
  • The Gluon implementation omits the num_stages parameter, since the compiler does not seem to generate pipelined code with the use of hardware-specific load/store operations (ex: gl.amd.cdna4.buffer_load).
  • Instead, the pipelining (equivalent to num_stages = 2) is manually written in the Gluon kernel.
  • A blocked layout is used to load and store contiguous column elements from the input. threads_per_warp and warps_per_cta are both determined by the hardware context. size_per_thread is determined by the given block size, but bounded by the number of elements that can be loaded by a single thread according to the input tensor's datatype.

Testing

A benchmark script for softmax can be found in op_tests/op_benchmarks/triton/bench_softmax.py, outputting time and bandwidth metrics. Run with the -gluon flag to get the timing results for the Gluon kernel. By default, the benchmark runs softmax on tensors of shape N = 8192 and M ranging from 1 to 8192. If the -N flag is passed, the N size varies instead at M = 8192.

Correctness for the Gluon kernel was tested with the op_tests//triton_tests/test_softmax.py script.

Below is a bandwidth plot comparing the performance of the Triton and Gluon kernels:
image

@eky-amd eky-amd requested a review from vgokhale October 21, 2025 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant