[DRAFT] Add Lds transpose load (ds.read_tr16) support on gfx950 for f16/bf16 #2029

stefankoncarevic · 2025-10-10T14:00:07Z

Motivation

DO NOT MERGE UNTIL #1906 IS MERGED
close: https://github.com/ROCm/rocMLIR-internal/issues/1858

Technical Details

This change adds full support for LDS transpose load integration within both single-buffering and double-buffering pipelines.
The implementation enables transpose-aware LDS loading for operands A and B, provided that both matrices use compatible memory layouts.
Currently, the logic performs iterations over the K dimension, while iteration over M and N dimensions is still under development and will be refined in the next update.
Future work will focus on performance evaluation and optimization of bank conflict patterns during LDS access.

git diff direct to lds vs transpose load:
dsreadtr16_vs_direct_to_lds2.txt

Test Plan

Basic functionality was verified using existing MFMA pipeline tests for both single and double buffering.
Next, I will extend the tests to cover various matrix layout configurations and measure execution performance.
A detailed performance table and LDS bank conflict statistics will be added in comment later to quantify the improvements.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

dhernandez0 · 2025-10-13T08:46:41Z

if this is based on direct to lds branch, please can you change the PR so that it merges into that branch? So that it's easier to review. We won't merge it into that branch of course.

This PR adds support for direct to LDS. Allowing ThreadwiseReadIntoOp to write to LDS directly. We change the LDS layout to use this functionality. Also, we add new direct to LDS scheduling options.

- Implemented rock.lds_transpose_load TD definition supporting f16 and bf16. - Added verifier to ensure source memref is in workgroup (LDS) memory and indices match rank. - Implemented lowering pattern to amdgpu.transpose_load. - Created MLIR tests covering FP16 and BF16 loads with FileCheck patterns.

dimensions

buffering pipeline When DirectToLDS is enabled, the pipeline now computes per-operand transpose decisions (decisionA, decisionB) based on MFMA shape and layout info before invoking LDS transpose loads.

This commit introduces the full implementation of LDS transpose load handling used in threadwise read and single-buffering and double buffering pipelines. It adds logic for computing per-lane base offsets, generating LdsTransposeLoadOp instructions, and managing vectorized fragment loading for MFMA operations. The implementation supports multiple layout kinds (e.g., L16x16, L32x16, L32x8) and dynamically expands offsets for multi-K fused cases.This enables more flexible data movement between LDS and registers for MFMA input tiles.

LDS transpose load decisions for both A and B operands.It adds architecture-aware selection of transpose configurations using hwtranspose::makeDecision, based on MFMA instruction shape and per-block tile sizes.Threadwise reads from LDS now attach transpose metadata when applicable, allowing the backend to emit LDSTransposeLoadOp for efficient wave-level data rearrangement.

Add a safeguard to skip configurations where mPerBlock or nPerBlock exceeds 32, since larger tile sizes are not yet supported by the current LDS transpose load implementation.

f16 and bf16 data types, with multiple K-dimension configurations and schedule versions. Add CFG file to restrict execution to gfx950 architecture only, ensuring tests run exclusively on supported hardware. All test cases have passed validation under gfx950.

stefankoncarevic · 2025-10-17T15:40:00Z

closing this, open new #2043

stefankoncarevic requested a review from causten as a code owner October 10, 2025 14:00

stefankoncarevic marked this pull request as draft October 10, 2025 14:07

stefankoncarevic self-assigned this Oct 10, 2025

stefankoncarevic changed the base branch from develop to direct_to_lds2 October 13, 2025 09:07

dhernandez0 force-pushed the direct_to_lds2 branch 6 times, most recently from fe5315c to 4a0b2ce Compare October 16, 2025 12:47

Base automatically changed from direct_to_lds2 to develop October 16, 2025 17:17

dhernandez0 and others added 12 commits October 17, 2025 08:56

Direct to LDS

19531f5

This PR adds support for direct to LDS. Allowing ThreadwiseReadIntoOp to write to LDS directly. We change the LDS layout to use this functionality. Also, we add new direct to LDS scheduling options.

Add helper functions to query MFMA intrinsic K and non-K

fc88159

dimensions

Add conditional HW transpose setup for A and B in single

74fae98

buffering pipeline When DirectToLDS is enabled, the pipeline now computes per-operand transpose decisions (decisionA, decisionB) based on MFMA shape and layout info before invoking LDS transpose loads.

Restrict LDS transpose load to M/N tiles ≤ 32

50d0935

Add a safeguard to skip configurations where mPerBlock or nPerBlock exceeds 32, since larger tile sizes are not yet supported by the current LDS transpose load implementation.

Add test file to cmake.

4a08d22

Add new lines at the end of files

281bd62

Add logic for LDS Transpose handling

5835742

Resolve build errors and merge conflicts

30a69cf

stefankoncarevic force-pushed the dsreadtr16_lds_transpose_load branch from e150da4 to 30a69cf Compare October 17, 2025 14:01

stefankoncarevic closed this Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] Add Lds transpose load (ds.read_tr16) support on gfx950 for f16/bf16 #2029

[DRAFT] Add Lds transpose load (ds.read_tr16) support on gfx950 for f16/bf16 #2029

Uh oh!

stefankoncarevic commented Oct 10, 2025 •

edited

Loading

Uh oh!

dhernandez0 commented Oct 13, 2025

Uh oh!

stefankoncarevic commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DRAFT] Add Lds transpose load (ds.read_tr16) support on gfx950 for f16/bf16 #2029

[DRAFT] Add Lds transpose load (ds.read_tr16) support on gfx950 for f16/bf16 #2029

Uh oh!

Conversation

stefankoncarevic commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

dhernandez0 commented Oct 13, 2025

Uh oh!

stefankoncarevic commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefankoncarevic commented Oct 10, 2025 •

edited

Loading