backward performance optimization for MI350 #4925

liligwu · 2025-09-24T00:52:30Z

bwd performance optimization for ROCm.
Fix numerical issues

netlify · 2025-09-24T00:52:38Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`570f148`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/690124a2371622000880421b
😎 Deploy Preview	https://deploy-preview-4925--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-09-24T02:17:01Z

@haoyuz has imported this pull request. If you are a Meta employee, you can view this in D83116315.

q10 · 2025-10-02T00:18:18Z

@liligwu we're seeing

OSError: libtbb.so.12: cannot open shared object file: No such file or directory

We already install tbb here, so it might just be an issue of updating the build scripts to put libtbb in the LD_LIBRARY_PATH

meta-codesync · 2025-10-13T20:54:01Z

@q10 has imported this pull request. If you are a Meta employee, you can view this in D83116315.

liligwu · 2025-10-13T20:55:59Z

@liligwu we're seeing
OSError: libtbb.so.12: cannot open shared object file: No such file or directory
We already install tbb here, so it might just be an issue of updating the build scripts to put libtbb in the LD_LIBRARY_PATH

Hi @q10 , sorry I missed your message.
I actually have this commit 4d2bfdd that links tbb explicitly, it works in your container. Do you have any suggestions for fixing this issue in CI, please?

BTW, we discovered a numerical issue in 986cceb and reverted it in 85417b4. It unblocks merging the bwd optimization first.

Thank you.

q10 · 2025-10-14T03:05:50Z

@liligwu we're seeing
OSError: libtbb.so.12: cannot open shared object file: No such file or directory
We already install tbb here, so it might just be an issue of updating the build scripts to put libtbb in the LD_LIBRARY_PATH
Hi @q10 , sorry I missed your message. I actually have this commit 4d2bfdd that links tbb explicitly, it works in your container. Do you have any suggestions for fixing this issue in CI, please?

BTW, we discovered a numerical issue in 986cceb and reverted it in 85417b4. It unblocks merging the bwd optimization first.

Thank you.

I think this commit only addresses the build step, where we need to link to tbb. However, for runtime, you might need to do a find in $CONDA_PREFIX from inside the container, and manually update LD_LIBRARY_PATH, or create a symlink (something like

FBGEMM/.github/scripts/utils_build.bash

Line 383 in 0d49628

    
           (print_exec ln -s "${conda_prefix}/lib/librhash.so" "${conda_prefix}/lib/librhash.so.0") || return 1

).

liligwu · 2025-10-15T22:47:28Z

@liligwu we're seeing
OSError: libtbb.so.12: cannot open shared object file: No such file or directory
We already install tbb here, so it might just be an issue of updating the build scripts to put libtbb in the LD_LIBRARY_PATH
Hi @q10 , sorry I missed your message. I actually have this commit 4d2bfdd that links tbb explicitly, it works in your container. Do you have any suggestions for fixing this issue in CI, please?
BTW, we discovered a numerical issue in 986cceb and reverted it in 85417b4. It unblocks merging the bwd optimization first.
Thank you.
I think this commit only addresses the build step, where we need to link to tbb. However, for runtime, you might need to do a find in $CONDA_PREFIX from inside the container, and manually update LD_LIBRARY_PATH, or create a symlink (something like

FBGEMM/.github/scripts/utils_build.bash

Line 383 in 0d49628

(print_exec ln -s "${conda_prefix}/lib/librhash.so" "${conda_prefix}/lib/librhash.so.0") || return 1

).

Hi @q10 , I can see a few line below of your example that links tbb

FBGEMM/.github/scripts/utils_build.bash

Line 388 in 0d49628

    
           (print_exec ln -s "${conda_prefix}/lib/libtbb.so.12" "${conda_prefix}/lib/libtbb.so") || return 1

One more thing is that installing tbb may not be sufficient.
for example, in CentOS we dnf install -y tbb-devel tbb

ionuthristodorescu

I will also send some diffs on Slack.

ionuthristodorescu · 2025-10-22T03:24:51Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu


                    // Compute shared memory size for cta_per_row
                    constexpr auto kCacheAccBytes = sizeof(at::acc_type<cache_t, true>);
-                    int32_t num_cta_per_row_groups = kMaxThreads / kWarpSize;


Is this line and the one below common between CUDA and ROCm ? If yes, we should the {% if rocm %} guards around them.

This is common.

{% if rocm %} guards are already applied. Do we need additional changes on here?

Let me take a look and review in the final version so that we're on the same page.

ionuthristodorescu · 2025-10-22T03:26:10Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu

                    // Compute shared memory size for cta_per_row
                    constexpr auto kCacheAccBytes = sizeof(at::acc_type<cache_t, true>);
-                    int32_t num_cta_per_row_groups = kMaxThreads / kWarpSize;
+                    int32_t total_L = indices.numel();


See above comment - total_L seems used only in the ROCm path, so move it under ifdef USE_ROCM ?

Moved int32_t total_L = indices.numel(); doen under {% if is_rocm %}

I couldn't find the fixes, not sure if it's another PR I missed or it hasn't been pushed. Would you mind sharing the latest version with all the fixes?

The fix regarding this is already on line 1065 on this PR. This display is outdated.

ionuthristodorescu · 2025-10-22T03:28:09Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu


                const bool use_deterministic_algorithms = at::globalContext().deterministicAlgorithms();
-                const int max_segment_length_per_cta = use_deterministic_algorithms ? INT_MAX : 1024;
+                const int max_segment_length_per_cta = use_deterministic_algorithms ? INT_MAX : 4096;


This seems to affect the CUDA regular path as well - please use a {% if rocm %} guard to select between 1024 and 4096.

Replaced {% if is_rocm %} onto here

ionuthristodorescu · 2025-10-22T03:29:44Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu

                    constexpr auto kCacheAccBytes = sizeof(at::acc_type<cache_t, true>);
-                    int32_t num_cta_per_row_groups = kMaxThreads / kWarpSize;
+                    int32_t total_L = indices.numel();
+                    #ifdef USE_ROCM


USE_ROCM is a ROCm-specific, could we guard it with a {% if rocm %} so it does not bleed into CUDA codegen ?

Replaced {% if is_rocm %} onto here

ionuthristodorescu · 2025-10-22T03:30:10Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu

                    FBGEMM_LAUNCH_KERNEL(
                        backward_cta_per_row_kernel,
                        cta_per_row_grid_size,
+                        // (64, 2)


Do we need this comment ?

ionuthristodorescu · 2025-10-22T03:33:16Z

fbgemm_gpu/codegen/training/index_select/batch_index_select_dim0_host.cpp

    TORCH_CHECK_EQ(grad_outputs.size(), 1);

-    constexpr int32_t max_segment_length_per_warp = 32;
+    constexpr int32_t max_segment_length_per_warp = 16384;


This path seems common with regular CUDA, could we guard it with a {% if rocm %} guard to select between 32 and 16384?

The max_segment_length_per_warp passed by host will be modified on embedding_split_host_pt2_autograd_template.cpp later. Reverting max_segment_length_per_warp on here back to 32.

ionuthristodorescu · 2025-10-22T03:33:37Z

fbgemm_gpu/codegen/training/index_select/batch_index_select_dim0_host.cpp

    const auto permute_output_dim_0_1 =
        ctx->saved_data["permute_output_dim_0_1"].toBool();

-    constexpr int32_t max_segment_length_per_warp = 32;


This path seems common with regular CUDA, could we guard it with a {% if rocm %} guard to select between 32 and 16384?

The max_segment_length_per_warp passed by host will be modified on embedding_split_host_pt2_autograd_template.cpp later. Reverting max_segment_length_per_warp on here back to 32.

ionuthristodorescu · 2025-10-22T03:35:11Z

fbgemm_gpu/codegen/training/pt2/embedding_split_host_pt2_autograd_template.cpp

    TORCH_CHECK(aux_tensor[IDX_LXU_CACHE_LOCATIONS].has_value(), "lxu_cache_locations should have value.");
    const auto lxu_cache_locations = aux_tensor[IDX_LXU_CACHE_LOCATIONS].value();
    const auto is_experimental = aux_bool[IDX_IS_EXPERIMENTAL_TBE];
+    const auto mixed_D = aux_bool[IDX_MIXED_D];


This path seems common with regular CUDA, could we guard it with a {% if rocm %} guard to select between 32 and 16384?

Applied {% if is_rocm %} onto here

I don't think rocm needs to be guarded here. mixed_D has been passed to all. It was not used in the forward function, so it's just being saved for backward.

Please use static_cast<bool>(aux_bool[IDX_MIXED_D]);

then you can replace this line to ctx->saved_data["mixed_D"] = mixed_D

ionuthristodorescu · 2025-10-22T03:37:19Z

fbgemm_gpu/codegen/training/pt2/embedding_split_host_pt2_autograd_template.cpp

+    // Workaround. Should not be upstreamed in any way.
+    // Redistribute all cta_per_row work to warp_per_row.
+    int32_t total_L = indices.numel();
+    {%- if (not nobag) and 


Could we add a {% if rocm %} guard around this code, total_L is used only in the USE_ROCM path and USE_ROCM is ROCm specific (so if we do not guard it with {% if rocm %} it will be codegen'ed for regular CUDA paths as well).

By adding a jinja, do you mean only applying {% if rocm %} around the total_L, i.e. @ionuthristodorescu

{% if rocm %} int32_t total_L = indices.numel(); {%- endif %}?

Or changing from using USE_ROCM to {% if rocm %} entirely?

The source code will generate but it shouldn't be compiled. Are we trying to double check here or does it cause any issues, because CUDA path should not compile this. Besides, I think rocm is not passed in this file as a global variable, so the condition will always be false, and total_L will never show up in the source code.

I agree with you. Either USE_ROCM or jinja {% if is_rocm %} is fine for us. Please let us know which one we stick with.

On this file, we should stick with USE_ROCM. I think your original change is correct. rocm is not defined in this file, so jinja will see it as false, and int32_t total_L = indices.numel(); may not show up in the generated source code. Let me review in the final version.

ionuthristodorescu · 2025-10-22T03:41:20Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_indice_weights_template.cu

            )
            {%- endif %}

-            for (auto j = 0; j < kWarpSize && l_start + j < L; ++j) {


Here we should use a {% if rocm %} guard to select between rolled / unrolled versions of the loop on regular, non-ROCm paths.

Applied {% if is_rocm %} onto here

warp per row wg change

…lity

…ameters

…nerator.cpp

spcyppt · 2025-10-26T04:35:44Z

fbgemm_gpu/codegen/training/pt2/embedding_split_host_pt2_autograd_template.cpp

    TORCH_CHECK(aux_tensor[IDX_LXU_CACHE_LOCATIONS].has_value(), "lxu_cache_locations should have value.");
    const auto lxu_cache_locations = aux_tensor[IDX_LXU_CACHE_LOCATIONS].value();
    const auto is_experimental = aux_bool[IDX_IS_EXPERIMENTAL_TBE];
+    const auto mixed_D = aux_bool[IDX_MIXED_D];


I don't think rocm needs to be guarded here. mixed_D has been passed to all. It was not used in the forward function, so it's just being saved for backward.

Please use static_cast<bool>(aux_bool[IDX_MIXED_D]);

then you can replace this line to ctx->saved_data["mixed_D"] = mixed_D

spcyppt · 2025-10-26T04:43:40Z

fbgemm_gpu/codegen/training/pt2/embedding_split_host_pt2_autograd_template.cpp

+    // Workaround. Should not be upstreamed in any way.
+    // Redistribute all cta_per_row work to warp_per_row.
+    int32_t total_L = indices.numel();
+    {%- if (not nobag) and 


The source code will generate but it shouldn't be compiled. Are we trying to double check here or does it cause any issues, because CUDA path should not compile this. Besides, I think rocm is not passed in this file as a global variable, so the condition will always be false, and total_L will never show up in the source code.

spcyppt · 2025-10-26T04:47:34Z

fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_training.py

                self.pooling_mode != PoolingMode.NONE
            ), "Mixed dimension tables only supported for pooling tables."
-
+        self.mixed_D = mixed_D


On this change, I assume mixed_D needs to be accessible as the module parameter? Would it cause any issues if it's not self.mixed_D? Asking this to check if we need to split the PRs for backend (C++ source code and backend codegen) and frontend changes (split_table_batched_embeddings_ops_training.py).

Optimization on warp_per_row and cta_per_row kernels will not be activated if self.mixed_D is not present.

spcyppt · 2025-10-26T05:28:42Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu


                    // Compute shared memory size for cta_per_row
                    constexpr auto kCacheAccBytes = sizeof(at::acc_type<cache_t, true>);
-                    int32_t num_cta_per_row_groups = kMaxThreads / kWarpSize;


This is common.

spcyppt · 2025-10-26T05:28:48Z

fbgemm_gpu/codegen/training/backward/embedding_backward_split_template.cu

                    // Compute shared memory size for cta_per_row
                    constexpr auto kCacheAccBytes = sizeof(at::acc_type<cache_t, true>);
-                    int32_t num_cta_per_row_groups = kMaxThreads / kWarpSize;
+                    int32_t total_L = indices.numel();


I couldn't find the fixes, not sure if it's another PR I missed or it hasn't been pushed. Would you mind sharing the latest version with all the fixes?

…ad-store' into mi350_dev

meta-cla bot added the cla signed label Sep 24, 2025

facebook-github-bot added the module: rocm label Sep 24, 2025

liligwu changed the title ~~forward performance tuning for MI350~~ backward performance optimization for MI350 Oct 20, 2025

ionuthristodorescu reviewed Oct 22, 2025

View reviewed changes

avbokovoy and others added 19 commits October 27, 2025 19:06

Add gfx950 build support + fp16 fix + index type fix

cd7dfea

Change int64_t to index_t as template parameters in load_raw_per_warp

602b7bf

Implement llvm fp16 buffer load for gfx950

a587e06

Fix c-style half to float cast

48a10bf

Patch 256 half stores

d4acaba

cta_per_row workgroup optim

a6636f0

Added mi350 guards

a15fb09

Fix index overflow in row load

6af95e0

cta_per_row workgroup reduce by 4 optim

be5f1b8

Fix mixed_D frontend to backend connection

acef908

changed max_segment_length_per_cta to 4096

33f4ad9

added rocm guards and removed comment

aaf1966

clean debug statements in Hip.cmake

48e7f97

Merge pull request #121

750bee4

warp per row wg change

Guard f16 llvm intrinsics with ROCm >=7.0

f0acbc3

fix the bug in dimention 160 in ROCm optimization

0ee2366

Cleanup optimized warp_per_raw kernel

e33120d

Add 320 embedding dim support for optimized warp_per_row kernel

3447ef0

changed the max length per warp and cta per row WG size

a1361ab

kudomcho and others added 20 commits October 27, 2025 19:06

added DPP and changed max length per warp to 16k

9c2fd1d

guard max segment warp based on emb dim

54690c9

added guarding opt of max segment for the case batch size list=1

d666611

opt for grad_indice_weights kernel

df863d0

added store row per warp on emb 192 and added accuracy test functiona…

e0bee9f

…lity

workgroup tuning and loop unrolled

ca82950

specialize

7ad444b

explicitly link to tbb

970229b

added warpReduceAllSum with rocm guards

539985c

revert unroll and wg tuning

e3d4773

Minor update embedding_forward_split_kernel_template.cu

9505ffe

add tbb-devel to the install_build_tools ()

8709307

fix lint issues

6a3d3cb

solve lint issues

6351c43

applied jinja is_rocm onto optimizations for backward and forward par…

1e9b3f3

…ameters

Guard supported grad_t for optimized warp_per_row dispatch

46b9f80

Forward index_t to the optimizer

ab5cf5d

Guard f16 llvm intrinsics with ROCm >=7.0

5164f6e

Fix buffer offset for emb_dim == 160

cde00fc

Remove sanity check

5d73b9c

liligwu force-pushed the mi350_dev branch from f8ee546 to 5d73b9c Compare October 27, 2025 19:08

liligwu added 2 commits October 27, 2025 20:14

address the potential lint issues and revert the change in indices_ge…

919db74

…nerator.cpp

addresss code style issue

3df3c91

spcyppt reviewed Oct 28, 2025

View reviewed changes

avbokovoy and others added 5 commits October 28, 2025 11:15

Remove general load/store methods

6c3a362

Move weight type check to compile-time

8cb6838

Switch to 256B stores for float type

ab6fa10

removed guard rocm on mixed_D and refactored mixed_D var assignment

c5a915d

Merge remote-tracking branch 'origin/abokovoi/mi350-remove-general-lo…

570f148

…ad-store' into mi350_dev

Bernard-Liu mentioned this pull request Oct 31, 2025

Several kernel optimization from aiter team #5074

Draft

backward performance optimization for MI350 #4925

Are you sure you want to change the base?

backward performance optimization for MI350 #4925

Uh oh!

Conversation

liligwu commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Sep 24, 2025

Uh oh!

q10 commented Oct 2, 2025

Uh oh!

meta-codesync bot commented Oct 13, 2025

Uh oh!

liligwu commented Oct 13, 2025

Uh oh!

q10 commented Oct 14, 2025

Uh oh!

liligwu commented Oct 15, 2025

Uh oh!

ionuthristodorescu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kudomcho Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

liligwu commented Sep 24, 2025 •

edited

Loading

netlify bot commented Sep 24, 2025 •

edited

Loading

kudomcho Oct 22, 2025 •

edited

Loading