Skip to content

Conversation

tbqh
Copy link
Collaborator

@tbqh tbqh commented Aug 25, 2025

Resolves #4773 which tracks a wrong calculation in the pointwise scheduler noticed by @zasdfgbnm:

The unit of cur_transfer_size and right_transfer_size were in the unit of byte ^ N, where N is the number of dimensions on the left/right of the break point... Later in code, we are comparing the transfer sizes with L2 cache. This makes no sense. We can not compare a quantity with unit byte ^ N with a quantity with unit byte.

Fixing it shows a perf improvement in test_nvfuser:

./bin/test_nvfuser --gtest_filter=PointwiseTest.*

- Before PR: 30 tests from PointwiseTest (7252 ms total)
- After PR:  30 tests from PointwiseTest (7500 ms total)

TODO: test benchmarks

@tbqh tbqh requested review from naoyam and zasdfgbnm August 25, 2025 15:53
Copy link

Description

  • Fix incorrect transfer_size calculation unit mismatch

  • Remove duplicated bit_multiple multiplication in loop

  • Correctly compute transfer size in bytes

  • Align transfer size with L2 cache comparison


Changes walkthrough 📝

Relevant files
Bug fix
pointwise.cpp
Correct transfer size computation in pointwise scheduler 

csrc/scheduler/pointwise.cpp

  • Initialize cur_transfer_size_bit and right_transfer_size_bit with
    correct bit multiples
  • Remove redundant multiplication by lhs_bit_multiple and
    rhs_bit_multiple inside loops
  • Correctly accumulate element counts across loop dimensions
  • Ensure final transfer size is in bytes for proper L2 cache comparison
  • +4/-5     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The initialization of cur_transfer_size_bit and right_transfer_size_bit has been changed to use only the bit multiples, but the subsequent multiplication with elem_counts may not correctly represent the total transfer size in bits, potentially leading to incorrect transfer cost estimation.

    int64_t cur_transfer_size_bit = lhs_bit_multiple;
    int64_t right_transfer_size_bit = rhs_bit_multiple;
    
    for (const auto left_i : arange(break_point_i)) {
      cur_transfer_size_bit = cur_transfer_size_bit * elem_counts[left_i];
    }
    
    for (const auto right_i : arange(break_point_i, ref_loop.size())) {
      right_transfer_size_bit =
          right_transfer_size_bit * elem_counts[right_i];
    }
    cur_transfer_size_bit *= right_transfer_size_bit;

    int64_t cur_transfer_size_bit = 1;
    int64_t right_transfer_size_bit = 1;
    int64_t cur_transfer_size_bit = lhs_bit_multiple;
    int64_t right_transfer_size_bit = rhs_bit_multiple;
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The issue prior to this change is that we were multiplying by the element bit-size inside of the for loop that multiplies all dimensions. The bit-size should only be included in once in the whole product. Now the for-loop computes the total number of elements on each side of the break point.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The change makes sense. Let's see what the performance would look like.

    @zasdfgbnm
    Copy link
    Collaborator

    In the above code (line 372), we have

        // How much would this transfer cost if it was done as a 1-D schedule
        int64_t transfer_size_1d_bit = 1;
    
        for (const auto i : arange(ref_loop.size())) {
          transfer_size_1d_bit =
              transfer_size_1d_bit * elem_counts[i] * dtype_sum_bit;
        }

    This number still has a unit of bit^n instead of bit.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    Pointwise heuristics incorrect memory access estimation
    3 participants