Have the pointwise scheduler process the Block Quantization Op. #5322

protonu · 2025-10-03T22:00:07Z

The PR:

Modified the pointwise scheduler so that it can handle the BlockQuantization op.
Test a fusion with a single block quantization op. This also tests the case where the output of the block quantization op can have a swizzled output domain.
Test the case where the fusion has multiple pointwise ops prior to the block quantization op. Please note for this test, I haven't compared the output of the fusion to the baseline where quantization is handled by the normalization scheduler.

This is stacked on top of:
#5266

github-actions · 2025-10-03T22:02:36Z

Review updated until commit 8d8c115

Description

Enable pointwise scheduler to handle BlockQuantization op
Ensure block scales output is fusion output
Support swizzled memory layout for block scales
Add vectorization for quantized and scale tensors

Changes walkthrough 📝

Relevant files

Enhancement

pointwise.cpp `Add BlockQuantization support in pointwise scheduler` csrc/scheduler/pointwise.cpp Added check for BlockQuantization op in `canScheduleCompileTime` Ensures block scales output is a fusion output Added vectorization support for block quantization outputs	+18/-0
registry_utils.cpp `Add utility to check block quantization output` csrc/scheduler/registry_utils.cpp Implemented `hasNonTerminalBlockQuantizeOp` to check if block scales is not a fusion output Returns true if block scales tensor is not marked as output	+13/-0
utils.cpp `Skip caching for block scales output tensor` csrc/scheduler/utils.cpp Added `isBlockScaleOutput` helper to identify block scales tensor Used in `cacheAndForkOutputs` to skip caching block scales	+14/-2
registry_utils.h `Declare block quantization output check function` csrc/scheduler/registry_utils.h Declared `hasNonTerminalBlockQuantizeOp` function Documents check for block scales as fusion output	+4/-0

Bug fix

vectorize_helper.cpp `Allow splits in block scales allocation domain` csrc/scheduler/vectorize_helper.cpp Added `isTvBlockScalesOutputOfBlockQuantization` to detect block scales Skips device split validation for block scales allocation domain	+21/-1

Tests

test_low_precision_recipe.cpp `Add tests for block quantization with swizzle` tests/cpp/test_low_precision_recipe.cpp Added `swizzle_output` parameter to `createNVFP4QunatizationFusion` Added test for single op with swizzled output Added test for multiple ops before block quantization	+157/-3

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The validation logic for allocation domain splits of block quantization's block scales output only checks for Split expressions and skips non-device splits, but does not validate the structure or correctness of the splits beyond that. This could allow malformed split patterns to pass silently.

auto is_block_scales_output =
    isTvBlockScalesOutputOfBlockQuantization(of_tv);
for (Expr* expr : exprs | std::views::reverse) {
  if (is_block_scales_output) {
    NVF_ERROR(
        expr->isA<Split>(),
        "alloc domain of block quantization should only have splits");
    if (!expr->as<Split>()->outer()->isDeviceDim()) {
      continue;
    }
  } else {
    validateDeviceSplit(expr);
  }

Possible Issue

The function isBlockScaleOutput checks if a TensorView is the blockScales output of a BlockQuantizationOp, but it does not validate that the definition exists before calling isA, which is redundant with the prior null check.

bool isBlockScaleOutput(TensorView* tv) {
  if (tv->definition() == nullptr ||
      !tv->definition()->isA<BlockQuantizationOp>()) {
    return false;
  }
  auto bq_op = tv->definition()->as<BlockQuantizationOp>();
  return bq_op->blockScales() == tv;
}

Test Coverage

The test AutoScheduleMultipleOps adds pointwise operations before block quantization but does not verify the correctness of the output against a baseline, limiting confidence in the correctness of the scheduling for complex fusions.

TEST_F(BQTest, AutoScheduleMultipleOps) {
  const int m = 1024;
  const int n = 1024;
  const int k = 16;
  std::vector<at::Tensor> inputs;
  inputs.push_back(at::randn({n, k}, at::device(at::kCUDA).dtype(at::kFloat)));
  inputs.push_back(
      at::randn({m, n, k}, at::device(at::kCUDA).dtype(at::kFloat)));

  std::unique_ptr<Fusion> fusion_new_op = std::make_unique<Fusion>();
  FusionGuard fg2(fusion_new_op.get());

  auto tv_in_1 = makeContigTensor(2, DataType::Float);
  auto tv_in_2 = makeContigTensor(3, DataType::Float);
  fusion_new_op->addInput(tv_in_1);
  fusion_new_op->addInput(tv_in_2);

  // tv_in_1 = broadcast(tv_in_1, {false, false, true});

  // t0 is 2D
  auto t_add = add(tv_in_1, tv_in_2);
  auto t_relu = relu(t_add);
  auto quantization_results = blockQuantize(t_relu);
  quantization_results.quantized_tensor->setMemoryType(MemoryType::Local);
  // auto t_out = set(quantization_results.quantized_tensor);

  // outputs are 3D
  fusion_new_op->addOutput(quantization_results.block_scales);
  fusion_new_op->addOutput(quantization_results.quantized_tensor);

  FusionExecutorCache executor_cache(std::move(fusion_new_op));
  auto outputs_new_op = executor_cache.runFusionWithInputs(inputs);

  // Verify we got the expected outputs
  auto block_scales_output = outputs_new_op[0].as<at::Tensor>();
  auto quantized_tensor_output = outputs_new_op[1].as<at::Tensor>();
}

protonu · 2025-10-07T22:00:45Z

!test

Copilot

Pull Request Overview

This PR enables the pointwise scheduler to handle Block Quantization operations by adding necessary checks and vectorization support. The changes allow block quantization ops to be processed through the pointwise scheduling path when appropriate.

Added compile-time validation to ensure block scales output is a fusion output
Enhanced vectorization support for block quantization outputs
Added comprehensive tests for single and multi-op fusion scenarios with block quantization

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/cpp/test_low_precision_recipe.cpp	Added test cases for auto-scheduling block quantization ops, including swizzled output scenarios
csrc/scheduler/vectorize_helper.cpp	Enhanced contiguous dimension mapping to handle block scales output allocation domains
csrc/scheduler/utils.cpp	Modified output caching to exclude block scales outputs from caching
csrc/scheduler/registry_utils.h	Added declaration for checking non-terminal block quantization ops
csrc/scheduler/registry_utils.cpp	Implemented function to detect non-terminal block quantization operations
csrc/scheduler/pointwise.cpp	Added compile-time checks and vectorization support for block quantization ops

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-07T22:03:04Z

tests/cpp/test_low_precision_recipe.cpp

                          public ::testing::WithParamInterface<DataType> {};
 namespace {
-void createNVFP4QunatizationFusion(Fusion* fusion, DataType data_hp_dtype) {
+void createNVFP4QunatizationFusion(


Corrected spelling of 'Qunatization' to 'Quantization'.

Suggested change

void createNVFP4QunatizationFusion(

void createNVFP4QuantizationFusion(

Copilot · 2025-10-07T22:03:04Z

tests/cpp/test_low_precision_recipe.cpp

+      quantization_results.block_scales->axis(4),
+      quantization_results.block_scales->axis(5)};


Accessing axis(5) when only 5 axes (0-4) exist after the splits. The block_scales tensor has 3 dimensions initially and after splits should have indices 0-4.

Suggested change

quantization_results.block_scales->axis(4),

quantization_results.block_scales->axis(5)};

quantization_results.block_scales->axis(4)};

Copilot · 2025-10-07T22:03:05Z

tests/cpp/test_low_precision_recipe.cpp

+  auto t_relu = relu(t_add);
+  auto quantization_results = blockQuantize(t_relu);
+  quantization_results.quantized_tensor->setMemoryType(MemoryType::Local);
+  // auto t_out = set(quantization_results.quantized_tensor);


Remove commented-out code that is not being used.

Suggested change

// auto t_out = set(quantization_results.quantized_tensor);

naoyam · 2025-10-07T22:16:24Z

I think this is too early to review. Let's focus on the first PR.

protonu · 2025-10-09T20:05:59Z

I think this is too early to review. Let's focus on the first PR.

I'll close this PR as changed our implementation of the block quantization op - where we don't use a reshape before the quantization anymore.

I have an updated PR for this: #5362

protonu · 2025-10-09T20:06:21Z

I have an updated PR for this: #5362

protonu added 6 commits October 2, 2025 12:31

experiment

5805ff6

adding support for swizzled output

9a89804

minor edits

6e3fd37

clean up

ad7b3db

rename tests

ba21a4c

undo format

1a94cdc

protonu added 2 commits October 7, 2025 14:41

adding more checks

138342b

merge

8d8c115

protonu changed the title ~~[DRAFT] Have the pointwise scheduler process the Block Quantization Op.~~ Have the pointwise scheduler process the Block Quantization Op. Oct 7, 2025

protonu requested review from Copilot, jjsjann123 and naoyam October 7, 2025 22:02

Copilot AI reviewed Oct 7, 2025

View reviewed changes

protonu closed this Oct 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Have the pointwise scheduler process the Block Quantization Op. #5322

Have the pointwise scheduler process the Block Quantization Op. #5322

Uh oh!

protonu commented Oct 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

protonu commented Oct 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

naoyam commented Oct 7, 2025

Uh oh!

protonu commented Oct 9, 2025

Uh oh!

protonu commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	void createNVFP4QunatizationFusion(
	void createNVFP4QuantizationFusion(

		quantization_results.block_scales->axis(4),
		quantization_results.block_scales->axis(5)};

Have the pointwise scheduler process the Block Quantization Op. #5322

Have the pointwise scheduler process the Block Quantization Op. #5322

Uh oh!

Conversation

protonu commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

protonu commented Oct 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

naoyam commented Oct 7, 2025

Uh oh!

protonu commented Oct 9, 2025

Uh oh!

protonu commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

protonu commented Oct 3, 2025 •

edited

Loading

github-actions bot commented Oct 3, 2025 •

edited

Loading