Skip to content

Conversation

@jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Oct 23, 2025

Fixes: #5391

The issue is root cause from having un-connected IDs in allocation domain, triggering loop promotion assert on ID not covered by loop domain. However, since loop domain should only check coverage on logical sizes, we shouldn't included allocation domain in loop graph in the first place.

Changes in this PR:

  1. exclude IDs only on path to allocation domain from LOOP and IEL graphs. This is done by modifying TensorDomain::allIds method to exclude allocation domain from the pairwise path traversal.
  2. minor changes to add inline support for layout op.

@jjsjann123 jjsjann123 requested a review from naoyam October 23, 2025 18:34
@jjsjann123
Copy link
Collaborator Author

!test

@github-actions
Copy link

Description

  • Exclude allocation domain IDs in loop graph construction

  • Enhance intersection graph with permissive mapping support

  • Add test for loop promotion issue in inference

  • Improve ID discovery for tensor domains


Changes walkthrough 📝

Relevant files
Bug fix
id_model.cpp
Add ID filtering excluding allocation domains and enhance graph
initialization

csrc/id_model/id_model.cpp

  • Added discoverIdsExceptAllocation to traverse tensor domains excluding
    allocation paths
  • Introduced findAllIdsExceptAllocation to collect non-allocation IDs
    from TensorView
  • Implemented initializeIdGraphExcludeAllocation for building loop
    graphs without allocation domain IDs
  • Modified buildIntersection to support permissive mapping and
    allocation domain exclusion
  • Updated buildStatefulInliningInfo to use non-allocation IDs for
    producer/consumer mapping
  • +130/-9 
    Enhancement
    loop_promotion.cpp
    Use permissive intersection with allocation domain exclusion in loop
    promotion

    csrc/id_model/loop_promotion.cpp

  • Updated buildIntersection call to use permissive mapping and exclude
    allocation domains
  • Modified terminal ID detection to check LOOP graph group membership
    before comparison
  • +13/-2   
    id_model.h
    Update IdModel interface for allocation-excluded graph initialization

    csrc/id_model/id_model.h

  • Declared initializeIdGraphExcludeAllocation method for loop graph
    initialization
  • Extended buildIntersection with parameters for permissive mapping and
    allocation domain exclusion
  • +9/-1     
    logical_domain_map.h
    Support PreprocessGroupedMatmulInputSf in logical domain mapping

    csrc/logical_domain_map.h

  • Added handle(PreprocessGroupedMatmulInputSf*) override to support new
    op in logical domain mapping
  • +4/-0     
    Tests
    test_layout_op.cpp
    Add test for loop promotion in inference scenario               

    tests/cpp/test_layout_op.cpp

  • Added InferenceBenchmarkLoopPromotionIssue test to validate loop
    promotion behavior
  • Created fusion with grouped matmul preprocessing and block scaling
    factors
  • Verified output correctness using validateGroupedLayout
  • +65/-1   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The function discoverIdsExceptAllocation collects IDs from multiple domains but may not correctly handle the shortest path logic when building out2in via getExprsBetween, as it accumulates all inputs and outputs from the path expressions without ensuring directional consistency or avoiding redundant or invalid mappings.

    void discoverIdsExceptAllocation(
        const TensorView* tv,
        VectorOfUniqueEntries<IterDomain*>& discovered_ids) {
      std::vector<const std::vector<IterDomain*>*> all_domains = {
          &tv->getLoopDomain(),
          &tv->getLogicalDomain(),
          &tv->getInitialLoopDomain(),
          &tv->domain()->additionalIDs()};
      if (tv->hasRoot()) {
        all_domains.push_back(&tv->getRootDomain());
      }
      if (tv->getAlternateLoopDomain().has_value()) {
        all_domains.push_back(&tv->getAlternateLoopDomain().value());
      }
    
      for (auto domain : all_domains) {
        discovered_ids.pushBack(*domain);
      }
    
      // We only care about IDs on the shortest path between domains
      std::unordered_multimap<IterDomain*, IterDomain*> out2in;
      for (auto i : arange(all_domains.size() - 1)) {
        if (all_domains[i]->empty()) {
          continue;
        }
        for (auto j : arange(i + 1, all_domains.size())) {
          if (all_domains[j]->empty()) {
            continue;
          }
          auto path = getExprsBetween<IRBFS>(
                          {all_domains[i]->begin(), all_domains[i]->end()},
                          {all_domains[j]->begin(), all_domains[j]->end()},
                          false)
                          .first;
          for (auto [expr, _] : path) {
            discovered_ids.pushBack(
                ir_utils::filterByType<IterDomain>(expr->outputs()));
            discovered_ids.pushBack(
                ir_utils::filterByType<IterDomain>(expr->inputs()));
          }
        }
      }
    };
    Performance Concern

    The repeated calls to findAllIdsExceptAllocation in buildStatefulInliningInfo may lead to redundant computation since each call recomputes the same set of IDs for the same TensorView; caching or sharing the result could improve performance.

    auto all_producer_ids = findAllIdsExceptAllocation(producer_tv);
    auto all_consumer_ids = findAllIdsExceptAllocation(consumer_tv);
    Correctness Risk

    The use of idGraph(IdMappingMode::LOOP) in LoopPromotionMapBuilder::build to check group membership assumes the graph contains all relevant IDs, but since the LOOP graph excludes allocation domain IDs, this may lead to incorrect filtering if some uses involve such IDs.

    return idGraph(IdMappingMode::LOOP).hasGroup(out) &&
        group != idGraph(IdMappingMode::LOOP).toGroup(out);

    @jjsjann123
    Copy link
    Collaborator Author

    jjsjann123 commented Oct 27, 2025

    To run @crcrpar 's example with nvfp4, i needed a few small fixes.

    So we would want to use thunder branch in Lightning-AI/lightning-thunder#2691

    I used nvfuser code in this PR, as well as the small fix for nvfp4 in #5428 (already approved, will update comment after merge).

    In order to run the benchmark, the command to use (run this in Thunder's directory). The program did run to completion, but I haven't got around to verify the model:

    NVFUSER_DISABLE=parallel_compile NVFUSER_ENABLE="id_model(all)" python thunder/benchmarks/benchmark_inference.py --mode thunder --enable-nv-linear --enable-nvfp4 --output-length 2
    

    Note: we need to add --output-length 2 otherwise the run takes forever....

    tagging @naoyam @protonu

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    IdModel assert during buildLoopGraph `nvfuser::LoopPromotionMapBuilder::findPromotionOfLoopGroup

    1 participant