Skip to content

Conversation

@jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Sep 9, 2025

Tries to fix #5391

But looks like changes to id_model loop graph isn't done right. i.e. loopPromotionMap is not properly mapping IDs for the layout op for some reason.

@jjsjann123
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Sep 9, 2025

Review updated until commit 81dd726

Description

  • Fix loop promotion mapping by excluding allocation domains

  • Refactor ID graph initialization to handle loop domains correctly

  • Add test case for loop promotion issue with layout ops

  • Adjust vectorization logic for sub-byte data types


Changes walkthrough 📝

Relevant files
Bug fix
id_model.cpp
Refactor ID collection and graph initialization                   

csrc/id_model/id_model.cpp

  • Added findAllIdsExceptAllocation to collect non-allocation IterDomains
  • Implemented initializeIdGraphExcludeAllocation for proper loop graph
    initialization
  • Updated buildIntersection to use non-allocation IDs and support
    permissive mapping
  • Modified buildStatefulInliningInfo to use filtered ID collections
  • +116/-11
    loop_promotion.cpp
    Fix loop promotion mapping logic                                                 

    csrc/id_model/loop_promotion.cpp

  • Updated buildIntersection call to use permissive mapping
  • Fixed loop promotion map builder to check group existence before
    comparison
  • +2/-2     
    Enhancement
    pointwise.cpp
    Adjust vectorization for sub-byte types                                   

    csrc/scheduler/pointwise.cpp

  • Added detection of sub-byte data types
  • Modified vectorization factor to handle sub-byte types
  • +4/-1     
    id_model.h
    Update ID model interface                                                               

    csrc/id_model/id_model.h

  • Declared new initializeIdGraphExcludeAllocation method
  • Updated buildIntersection signature to include permissive parameter
  • +5/-1     
    logical_domain_map.h
    Support grouped matmul layout op                                                 

    csrc/logical_domain_map.h

    • Added handler for PreprocessGroupedMatmulInputSf op
    +4/-0     
    Tests
    test_layout_op.cpp
    Add loop promotion test case                                                         

    tests/cpp/test_layout_op.cpp

  • Added test case for loop promotion issue with layout ops
  • Created fusion with explicit quantization pattern and grouped matmul
  • +65/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The function findAllIdsExceptAllocation includes IDs from multiple domains and attempts to find paths between them, but it may incorrectly include or exclude IDs related to allocation. The comment at line 1011 suggests that using allIDs is incorrect and allocation paths should be excluded, but it's unclear if the new logic fully resolves this issue.

    void findAllIdsExceptAllocation(
        const TensorView* tv,
        VectorOfUniqueEntries<IterDomain*>& discovered_ids) {
      std::vector<const std::vector<IterDomain*>*> all_domains = {
          &tv->getLoopDomain(),
          &tv->getLogicalDomain(),
          &tv->getInitialLoopDomain(),
          &tv->domain()->additionalIDs()};
      if (tv->hasRoot()) {
        all_domains.push_back(&tv->getRootDomain());
      }
      if (tv->getAlternateLoopDomain().has_value()) {
        all_domains.push_back(&tv->getAlternateLoopDomain().value());
      }
    
      for (auto domain : all_domains) {
        discovered_ids.pushBack(*domain);
      }
    
      // We only care about IDs on the shortest path between domains
      std::unordered_multimap<IterDomain*, IterDomain*> out2in;
      for (auto i : arange(all_domains.size() - 1)) {
        if (all_domains[i]->empty()) {
          continue;
        }
        for (auto j : arange(i + 1, all_domains.size())) {
          if (all_domains[j]->empty()) {
            continue;
          }
          auto path = getExprsBetween<IRBFS>(
                          {all_domains[i]->begin(), all_domains[i]->end()},
                          {all_domains[j]->begin(), all_domains[j]->end()},
                          false)
                          .first;
          for (auto [expr, _] : path) {
            discovered_ids.pushBack(
                ir_utils::filterByType<IterDomain>(expr->outputs()));
            discovered_ids.pushBack(
                ir_utils::filterByType<IterDomain>(expr->inputs()));
          }
        }
      }
    };
    
    std::vector<IterDomain*> findAllIdsExceptAllocation(const TensorView* tv) {
      VectorOfUniqueEntries<IterDomain*> discovered_ids;
      findAllIdsExceptAllocation(tv, discovered_ids);
      return discovered_ids.vector();
    }
    Logic Error

    The condition in line 1223 adds a check for idGraph(IdMappingMode::LOOP).hasGroup(out) which may alter the behavior of terminal ID detection. This change could affect loop promotion correctness, especially if the group existence was previously assumed.

    if (std::ranges::any_of(use->outputs(), [&](Val* out) -> bool {
          return idGraph(IdMappingMode::LOOP).hasGroup(out) && group != idGraph(IdMappingMode::LOOP).toGroup(out);
        })) {
    Performance Impact

    The vectorization factor is now conditionally set to 2 when has_sub_byte is true, which may limit vectorization benefits for sub-byte data types. This change could negatively impact performance for such cases.

          is_outer_broadcast_dominated = lhs_bit_multiple < rhs_bit_multiple;
        }
      }
    }
    
    params->vectorization_factor = std::min(
        has_sub_byte ? max_vect_factor : 2,
        vectorize_helper::getVectorizationFactor(
            runtime_info,
            largest_out,
            data_cache,

    @jjsjann123 jjsjann123 force-pushed the jj/wip branch 2 times, most recently from f994f4e to dcdeff5 Compare September 10, 2025 21:12
    @jjsjann123 jjsjann123 force-pushed the jj/layout_op_PR2_manual_kernel branch from a7f2636 to 6072fd3 Compare September 10, 2025 23:48
    @jjsjann123 jjsjann123 force-pushed the jj/wip branch 2 times, most recently from 29889e8 to 856f291 Compare September 15, 2025 16:16
    Base automatically changed from jj/layout_op_PR2_manual_kernel to main September 15, 2025 22:37
    @jjsjann123 jjsjann123 force-pushed the jj/wip branch 4 times, most recently from f6e9a8d to 303f27d Compare September 17, 2025 10:05
    @jjsjann123 jjsjann123 force-pushed the jj/wip branch 7 times, most recently from 80e2206 to 5e2f7d1 Compare October 14, 2025 08:35
    @jjsjann123 jjsjann123 force-pushed the jj/wip branch 2 times, most recently from fabd8f1 to 7ac12bb Compare October 16, 2025 23:27
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    errors seen in CI:

    00:09:47 [ RUN      ] HopperMatmulTest/MLPGemmPersistentBroadcastInputs.NumWarpGroups/2
    00:09:48 To reproduce: NVFUSER_TEST_RANDOM_SEED=1760717889 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='HopperMatmulTest/MLPGemmPersistentBroadcastInputs.NumWarpGroups/2'
    00:09:48 unknown file: Failure
    00:09:48 C++ exception with description " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/val_graph.cpp:82, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. 
    00:09:48 Expected disjoint_set_it != disjoint_vals_.disjointSetMap().end() . 
    00:09:48 Val group could not be found in graph associated with: iS247{8}
    
    00:04:40 [ RUN      ] SwizzleTest.Transpose1
    00:04:40 To reproduce: NVFUSER_TEST_RANDOM_SEED=1760717581 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='SwizzleTest.Transpose1'
    00:04:40 unknown file: Failure
    00:04:40 C++ exception with description " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/val_graph.cpp:657, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. 
    00:04:40 Expected found . ExprGroup not found in unique_uses_. Expr: Xor(2D): iS11{32} , iS7{32} -> iS16{32} , iS17{32}
    00:04:40 
    00:04:40 Exception raised from validateConsistency at /opt/pytorch/nvfuser/csrc/val_graph.cpp:657 (most recent call first):
    00:04:40 frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfb (0x613c987d6c79 in /opt/pytorch/nvfuser/bin/test_nvfuser)
    00:04:40 frame #1: nvfuser::nvfErrorFail(char const*, char const*, long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x69 (0x613c98d42c69 in /opt/pytorch/nvfuser/bin/test_nvfuser)
    

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    😮 it actually worked~~ I'll polish up the PR and get it ready to be reviewed.

    @jjsjann123 jjsjann123 closed this Oct 24, 2025
    @jjsjann123
    Copy link
    Collaborator Author

    jjsjann123 commented Oct 24, 2025

    replaced with #5426

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    IdModel assert during buildLoopGraph `nvfuser::LoopPromotionMapBuilder::findPromotionOfLoopGroup

    2 participants