Skip to content

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented Oct 3, 2025

The PR:

  1. Modified the pointwise scheduler so that it can handle the BlockQuantization op.
  2. Test a fusion with a single block quantization op. This also tests the case where the output of the block quantization op can have a swizzled output domain.
  3. Test the case where the fusion has multiple pointwise ops prior to the block quantization op. Please note for this test, I haven't compared the output of the fusion to the baseline where quantization is handled by the normalization scheduler.

This is stacked on top of:
#5266

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Review updated until commit 8d8c115

Description

  • Enable pointwise scheduler to handle BlockQuantization op

  • Ensure block scales output is fusion output

  • Support swizzled memory layout for block scales

  • Add vectorization for quantized and scale tensors


Changes walkthrough 📝

Relevant files
Enhancement
pointwise.cpp
Add BlockQuantization support in pointwise scheduler         

csrc/scheduler/pointwise.cpp

  • Added check for BlockQuantization op in canScheduleCompileTime
  • Ensures block scales output is a fusion output
  • Added vectorization support for block quantization outputs
  • +18/-0   
    registry_utils.cpp
    Add utility to check block quantization output                     

    csrc/scheduler/registry_utils.cpp

  • Implemented hasNonTerminalBlockQuantizeOp to check if block scales is
    not a fusion output
  • Returns true if block scales tensor is not marked as output
  • +13/-0   
    utils.cpp
    Skip caching for block scales output tensor                           

    csrc/scheduler/utils.cpp

  • Added isBlockScaleOutput helper to identify block scales tensor
  • Used in cacheAndForkOutputs to skip caching block scales
  • +14/-2   
    registry_utils.h
    Declare block quantization output check function                 

    csrc/scheduler/registry_utils.h

  • Declared hasNonTerminalBlockQuantizeOp function
  • Documents check for block scales as fusion output
  • +4/-0     
    Bug fix
    vectorize_helper.cpp
    Allow splits in block scales allocation domain                     

    csrc/scheduler/vectorize_helper.cpp

  • Added isTvBlockScalesOutputOfBlockQuantization to detect block scales
  • Skips device split validation for block scales allocation domain
  • +21/-1   
    Tests
    test_low_precision_recipe.cpp
    Add tests for block quantization with swizzle                       

    tests/cpp/test_low_precision_recipe.cpp

  • Added swizzle_output parameter to createNVFP4QunatizationFusion
  • Added test for single op with swizzled output
  • Added test for multiple ops before block quantization
  • +157/-3 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The validation logic for allocation domain splits of block quantization's block scales output only checks for Split expressions and skips non-device splits, but does not validate the structure or correctness of the splits beyond that. This could allow malformed split patterns to pass silently.

    auto is_block_scales_output =
        isTvBlockScalesOutputOfBlockQuantization(of_tv);
    for (Expr* expr : exprs | std::views::reverse) {
      if (is_block_scales_output) {
        NVF_ERROR(
            expr->isA<Split>(),
            "alloc domain of block quantization should only have splits");
        if (!expr->as<Split>()->outer()->isDeviceDim()) {
          continue;
        }
      } else {
        validateDeviceSplit(expr);
      }
    Possible Issue

    The function isBlockScaleOutput checks if a TensorView is the blockScales output of a BlockQuantizationOp, but it does not validate that the definition exists before calling isA, which is redundant with the prior null check.

    bool isBlockScaleOutput(TensorView* tv) {
      if (tv->definition() == nullptr ||
          !tv->definition()->isA<BlockQuantizationOp>()) {
        return false;
      }
      auto bq_op = tv->definition()->as<BlockQuantizationOp>();
      return bq_op->blockScales() == tv;
    }
    Test Coverage

    The test AutoScheduleMultipleOps adds pointwise operations before block quantization but does not verify the correctness of the output against a baseline, limiting confidence in the correctness of the scheduling for complex fusions.

    TEST_F(BQTest, AutoScheduleMultipleOps) {
      const int m = 1024;
      const int n = 1024;
      const int k = 16;
      std::vector<at::Tensor> inputs;
      inputs.push_back(at::randn({n, k}, at::device(at::kCUDA).dtype(at::kFloat)));
      inputs.push_back(
          at::randn({m, n, k}, at::device(at::kCUDA).dtype(at::kFloat)));
    
      std::unique_ptr<Fusion> fusion_new_op = std::make_unique<Fusion>();
      FusionGuard fg2(fusion_new_op.get());
    
      auto tv_in_1 = makeContigTensor(2, DataType::Float);
      auto tv_in_2 = makeContigTensor(3, DataType::Float);
      fusion_new_op->addInput(tv_in_1);
      fusion_new_op->addInput(tv_in_2);
    
      // tv_in_1 = broadcast(tv_in_1, {false, false, true});
    
      // t0 is 2D
      auto t_add = add(tv_in_1, tv_in_2);
      auto t_relu = relu(t_add);
      auto quantization_results = blockQuantize(t_relu);
      quantization_results.quantized_tensor->setMemoryType(MemoryType::Local);
      // auto t_out = set(quantization_results.quantized_tensor);
    
      // outputs are 3D
      fusion_new_op->addOutput(quantization_results.block_scales);
      fusion_new_op->addOutput(quantization_results.quantized_tensor);
    
      FusionExecutorCache executor_cache(std::move(fusion_new_op));
      auto outputs_new_op = executor_cache.runFusionWithInputs(inputs);
    
      // Verify we got the expected outputs
      auto block_scales_output = outputs_new_op[0].as<at::Tensor>();
      auto quantized_tensor_output = outputs_new_op[1].as<at::Tensor>();
    }

    @protonu protonu changed the title [DRAFT] Have the pointwise scheduler process the Block Quantization Op. Have the pointwise scheduler process the Block Quantization Op. Oct 7, 2025
    @protonu
    Copy link
    Collaborator Author

    protonu commented Oct 7, 2025

    !test

    Copy link

    Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR enables the pointwise scheduler to handle Block Quantization operations by adding necessary checks and vectorization support. The changes allow block quantization ops to be processed through the pointwise scheduling path when appropriate.

    • Added compile-time validation to ensure block scales output is a fusion output
    • Enhanced vectorization support for block quantization outputs
    • Added comprehensive tests for single and multi-op fusion scenarios with block quantization

    Reviewed Changes

    Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

    Show a summary per file
    File Description
    tests/cpp/test_low_precision_recipe.cpp Added test cases for auto-scheduling block quantization ops, including swizzled output scenarios
    csrc/scheduler/vectorize_helper.cpp Enhanced contiguous dimension mapping to handle block scales output allocation domains
    csrc/scheduler/utils.cpp Modified output caching to exclude block scales outputs from caching
    csrc/scheduler/registry_utils.h Added declaration for checking non-terminal block quantization ops
    csrc/scheduler/registry_utils.cpp Implemented function to detect non-terminal block quantization operations
    csrc/scheduler/pointwise.cpp Added compile-time checks and vectorization support for block quantization ops

    Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

    public ::testing::WithParamInterface<DataType> {};
    namespace {
    void createNVFP4QunatizationFusion(Fusion* fusion, DataType data_hp_dtype) {
    void createNVFP4QunatizationFusion(
    Copy link

    Copilot AI Oct 7, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Corrected spelling of 'Qunatization' to 'Quantization'.

    Suggested change
    void createNVFP4QunatizationFusion(
    void createNVFP4QuantizationFusion(

    Copilot uses AI. Check for mistakes.
    Comment on lines +494 to +495
    quantization_results.block_scales->axis(4),
    quantization_results.block_scales->axis(5)};
    Copy link

    Copilot AI Oct 7, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Accessing axis(5) when only 5 axes (0-4) exist after the splits. The block_scales tensor has 3 dimensions initially and after splits should have indices 0-4.

    Suggested change
    quantization_results.block_scales->axis(4),
    quantization_results.block_scales->axis(5)};
    quantization_results.block_scales->axis(4)};

    Copilot uses AI. Check for mistakes.
    auto t_relu = relu(t_add);
    auto quantization_results = blockQuantize(t_relu);
    quantization_results.quantized_tensor->setMemoryType(MemoryType::Local);
    // auto t_out = set(quantization_results.quantized_tensor);
    Copy link

    Copilot AI Oct 7, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Remove commented-out code that is not being used.

    Suggested change
    // auto t_out = set(quantization_results.quantized_tensor);

    Copilot uses AI. Check for mistakes.
    @naoyam
    Copy link
    Collaborator

    naoyam commented Oct 7, 2025

    I think this is too early to review. Let's focus on the first PR.

    @protonu
    Copy link
    Collaborator Author

    protonu commented Oct 9, 2025

    I think this is too early to review. Let's focus on the first PR.

    I'll close this PR as changed our implementation of the block quantization op - where we don't use a reshape before the quantization anymore.

    I have an updated PR for this: #5362

    @protonu
    Copy link
    Collaborator Author

    protonu commented Oct 9, 2025

    I have an updated PR for this: #5362

    @protonu protonu closed this Oct 9, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants