[TRITON] Gluon softmax implementation #1227

eky-amd · 2025-10-20T23:56:15Z

Changes

Wrote Gluon implementation of online softmax kernel, found in aiter/ops/triton/gluon/softmax.py.

Gluon kernel mostly uses same parameters as Triton kernel (i.e. num_warps, waves_per_eu).
Triton kernel in aiter/ops/triton/softmax.py was modified so that the num_stages parameter is correctly passed into the tl.range loops.
The Gluon implementation omits the num_stages parameter, since the compiler does not seem to generate pipelined code with the use of hardware-specific load/store operations (ex: gl.amd.cdna4.buffer_load).
Instead, the pipelining (equivalent to num_stages = 2) is manually written in the Gluon kernel.
A blocked layout is used to load and store contiguous column elements from the input. threads_per_warp and warps_per_cta are both determined by the hardware context. size_per_thread is determined by the given block size, but bounded by the number of elements that can be loaded by a single thread according to the input tensor's datatype.

Testing

A benchmark script for softmax can be found in op_tests/op_benchmarks/triton/bench_softmax.py, outputting time and bandwidth metrics. Run with the -gluon flag to get the timing results for the Gluon kernel. By default, the benchmark runs softmax on tensors of shape N = 8192 and M ranging from 1 to 8192. If the -N flag is passed, the N size varies instead at M = 8192.

Correctness for the Gluon kernel was tested with the op_tests//triton_tests/test_softmax.py script.

Below is a bandwidth plot comparing the performance of the Triton and Gluon kernels:

eky-amd added 16 commits October 2, 2025 22:37

fixed instr_shape bug in gemm_a8w8_blockscale

7d3961a

working gluon softmax

82c69ed

cleaned up softmax bench script

4888709

use num_warps from context

1cffb80

fixed issue with model param for bench_softmax

66b28d0

better performance using compiler parameters

963e88a

use buffer store

4db7a3d

pipelined version of gluon softmax

bcaf831

testing sizePerThread bound in Gluon

f002e6e

reformatting

5042318

edited fixme for gluon softmax

b1e4ecc

correctly bounded size_per_thread

881030a

removed throughput bench

0fae61e

added more cases for -N benchmark

93bf4c9

latest

2f69b13

removed num_stages param

8f03456

eky-amd requested a review from vgokhale October 21, 2025 19:25

eky-amd added 3 commits October 24, 2025 16:40

all-reduce is working

874ba66

completing merge

5364672

reformatting

26584c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON] Gluon softmax implementation #1227

[TRITON] Gluon softmax implementation #1227

Uh oh!

eky-amd commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[TRITON] Gluon softmax implementation #1227

Are you sure you want to change the base?

[TRITON] Gluon softmax implementation #1227

Uh oh!

Conversation

eky-amd commented Oct 20, 2025

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant