Skip to content

Conversation

@stefankoncarevic
Copy link
Contributor

Motivation

close: https://github.com/ROCm/rocMLIR-internal/issues/1858

Technical Details

This change adds full support for LDS transpose load integration within both single-buffering and double-buffering pipelines.
The implementation enables transpose-aware LDS loading for operands A and B, provided that both matrices use compatible memory layouts.
Currently, the logic performs iterations over the K dimension, while iteration over M and N dimensions is still under development and will be refined in the next update.
Future work will focus on performance evaluation and optimization of bank conflict patterns during LDS access.

Test Plan

Basic functionality was verified using existing MFMA pipeline tests for both single and double buffering.
Next, I will extend the tests to cover various matrix layout configurations and measure execution performance.
A detailed performance table and LDS bank conflict statistics will be added in comment later to quantify the improvements.

Test Result

Submission Checklist

- Implemented rock.lds_transpose_load TD definition
  supporting f16 and bf16.
- Added verifier to ensure source memref is in workgroup (LDS)
  memory and indices match rank.
- Implemented lowering pattern to amdgpu.transpose_load.
- Created MLIR tests covering FP16 and BF16 loads with
  FileCheck patterns.
This commit introduces the full implementation of LDS transpose
load handling used in threadwise read and single-buffering
and double buffering pipelines. It adds logic for computing
per-lane base offsets, generating LdsTransposeLoadOp instructions,
and managing vectorized fragment loading for MFMA operations.
The implementation supports multiple layout kinds
(e.g., L16x16, L32x16, L32x8) and dynamically expands offsets
for multi-K fused cases.This enables more flexible data movement
between LDS and registers for MFMA input tiles.
f16 and bf16 data types, with multiple K-dimension configurations
and schedule versions.
Add CFG file to restrict execution to gfx950 architecture only,
ensuring tests run exclusively on supported hardware.
All test cases have passed validation under gfx950.
for double buffering. Added global decision context to propagate
hwtranspose::Decision from BlockwiseGemm to ThreadwiseReadIntoOp.
Updated ThreadwiseReadIntoOp to attach hwtranspose attributes
when a valid decision is available. Fixed double buffering
handling to ensure correct LDS access.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants