Skip to content

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented Oct 9, 2025

Stacked on top of: #5266

This extends the pointwise scheduler to accept a fusion with a block quantization op.
The block scale output of the block quantization op must be a fusion segment output.

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

Description

  • Enable pointwise scheduler for block quantization ops

  • Ensure block scales output is fusion output

  • Add transitive check for block scales output

  • Add test for auto-scheduling with swizzle


Changes walkthrough 📝

Relevant files
Bug fix
logical_domain_map.cpp
Refine domain mapping for block quantization                         

csrc/logical_domain_map.cpp

  • Restrict non-mapping domain logic to only when producer is input and
    consumer is block scales
  • Ensures correct handling of last logical dimension in block
    quantization context
  • +1/-1     
    utils.cpp
    Avoid caching block scales outputs                                             

    csrc/scheduler/utils.cpp

  • Exclude block scales outputs from caching/forking in
    cacheAndForkOutputs
  • Prevents incorrect memory optimization on block quantization outputs
  • +4/-1     
    Enhancement
    pointwise.cpp
    Support block quantization in pointwise scheduling             

    csrc/scheduler/pointwise.cpp

  • Add check to reject scheduling if block scales is not a fusion output
  • Include block quantization outputs in vectorization list
  • +18/-0   
    registry_utils.cpp
    Add check for terminal block quantization output                 

    csrc/scheduler/registry_utils.cpp

  • Implement hasNonTerminalBlockQuantizeOp to check if block scales is
    not a fusion output
  • Used to reject invalid scheduling configurations
  • +13/-0   
    domain_map.cpp
    Skip domain checks for block scales outputs                           

    csrc/scheduler/tools/domain_map.cpp

  • Add isTransitiveBlockScaleOuput helper to detect block scales through
    producer chain
  • Skip validity checks for such outputs in isValidReference
  • +53/-1   
    registry_utils.h
    Declare block quantization output check function                 

    csrc/scheduler/registry_utils.h

  • Declare hasNonTerminalBlockQuantizeOp function
  • Used in pointwise scheduler to validate fusion structure
  • +4/-0     
    Tests
    test_low_precision_recipe.cpp
    Add auto-scheduling test for block quantization                   

    tests/cpp/test_low_precision_recipe.cpp

  • Add test AutoScheduleSingleOpWithSwizzle for pointwise scheduling of
    block quantization
  • Validates correctness against baseline implementation
  • +69/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Scheduling Restriction

    The PR rejects scheduling if any Block Quantization Op's block scales are not fusion outputs, but does not validate whether this restriction is necessary or if alternative handling (e.g., intermediate usage) could be supported. This limitation should be justified with performance or correctness reasoning.

    if (registry_utils::hasNonTerminalBlockQuantizeOp(fusion)) {
      scheduler_debug_utils::canScheduleRejectReason(
          schedulerType(),
          "no support for block quantization where block scales is not a fusion "
          "output");
      return false;
    Transitive Check Logic

    The function isTransitiveBlockScaleOuput traverses only the first producer in case of multiple producers, which may miss valid block scale paths. This simplification could lead to incorrect domain mapping decisions if other producer branches contain block scale outputs.

      // Move to the first producer for continued traversal
      // If there are multiple producers, we check the first one for simplicity
      // This could be extended to check all paths if needed
      if (!producers.empty()) {
        current_tv = producers[0];
      } else {
        current_tv = nullptr; // No more producers to check
      }
    }
    CacheFork Bypass

    The cacheFork optimization skips outputs of BlockQuantizationOp that are block scales, but does not consider whether such outputs might still benefit from caching in certain fusion patterns. This could limit optimization opportunities for complex fusions involving block quantization.

    output->definition()->isA<ScatterOp>() ||
    (output->definition()->isA<BlockQuantizationOp>() &&
     output->definition()->as<BlockQuantizationOp>()->blockScales() ==
         output)) {

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant