Skip to content

Conversation

@tbqh
Copy link
Collaborator

@tbqh tbqh commented Oct 7, 2025

Rewrite and rebase of #5121. Adds a new presegmentation pass "fmin_fmax_promotion" which switches min/max reductions with fmin/fmax reductions where possible. Original motivation on #319.

The new pass does dataflow analysis by attaching an enum to IterDomain's. It flows these downward and checks whether any corrupted "BAD" states end up in the output. Currently can only handle 4 operator types:

  1. UnaryOp
  2. ReduceOp
  3. BroadcastOp
  4. BinaryOp

For any other operator type, or if at any point we fail to map an IterDomain through an operator, we treat the operator as a fusion output.

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Review updated until commit df9a676

Description

  • Promote min/max reductions to fmin/fmax when safe

  • Prevent NaN propagation issues in reduction outputs

  • Analyze dataflow using IdModel for accurate mapping

  • Add comprehensive tests for fmin/fmax promotion cases


Changes walkthrough 📝

Relevant files
Enhancement
fmin_fmax_promotion.cpp
Implement fmin/fmax promotion with dataflow analysis         

csrc/preseg_passes/fmin_fmax_promotion.cpp

  • Implement new presegmentation pass to promote min/max to fmin/fmax
  • Use NanStatus enum for tracking NaN propagation through reductions
  • Perform downward dataflow analysis to detect unsafe promotions
  • Utilize IdModel for accurate IterDomain mapping
  • +251/-0 
    pre_segmenter.cpp
    Register fmin/fmax promotion pass in pipeline                       

    csrc/preseg_passes/pre_segmenter.cpp

  • Include new fmin_fmax_promotion.h header
  • Register FMinFMaxPromotionPass in pre-segmentation pipeline
  • Position pass after AddAxiomsPass and before MoveSplitCatPass
  • +2/-0     
    internal_nodes.h
    Add markUnsafe method for reduction ops                                   

    csrc/ir/internal_nodes.h

  • Add markUnsafe() method to ReductionOp class
  • Convert BinaryOpType from Min/Max to FMin/FMax
  • Enable promotion of reduction operations in IR
  • +9/-0     
    fmin_fmax_promotion.h
    Declare fmin/fmax promotion pass interface                             

    csrc/preseg_passes/fmin_fmax_promotion.h

  • Declare FMinFMaxPromotionPass class
  • Document NaN propagation behavior differences
  • Explain conditions under which promotion is safe
  • Define pass as OptimizationPass specialization
  • +41/-0   
    Tests
    test_math_opt.cpp
    Add tests for fmin/fmax promotion pass                                     

    tests/cpp/test_math_opt.cpp

  • Add FMinFMaxPromotionTest with 9 test cases
  • Test various reduction topologies and broadcast patterns
  • Verify fmax presence/absence in generated kernel code
  • Include NaN values in test tensors for validation
  • +121/-0 
    Configuration changes
    CMakeLists.txt
    Add fmin_fmax_promotion to build system                                   

    CMakeLists.txt

  • Add fmin_fmax_promotion.cpp to NVFUSER_SRCS
  • Include new presegmentation pass in build
  • +1/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The function reductionDomainIsCovered assumes that expr->output(0) is a TensorView and uses it directly. However, in cases where the expression has multiple outputs or no TensorView outputs, this could lead to incorrect behavior or crashes. The check auto* out_tv = dynamic_cast<TensorView*>(expr->output(0)); should ensure that the output is valid, but there is no handling for expressions with more than one output, which may result in missed analysis or incorrect propagation of NanStatus.

    auto* out_tv = dynamic_cast<TensorView*>(expr->output(0));
    
    bool canBeAnalyzed = expr->isA<UnaryOp>() || expr->isA<ReductionOp>() ||
    Performance Concern

    The current implementation runs a separate dataflow analysis for each reduction domain in minMaxOpIsCovered, which can be inefficient for fusions with many reduction axes. Although the comment acknowledges this, it would be beneficial to consider merging these analyses into a single traversal to improve performance, especially for large fusions.

    //
    // Note that this currently re-runs the traversal/dataflow analysis for every
    // single reduction domain. This could be merged into a single traversal,
    // however it would require per-domain tracked of the NanStatusMap, and it would
    // make the propagation code more complicated.
    Possible Issue

    The markUnsafe method modifies the reduction operation type in-place without validating whether the operation is actually safe to change. If this method is called multiple times on the same ReductionOp, it will incorrectly convert FMax to FMin and vice versa, leading to incorrect kernel behavior. A safeguard should be added to prevent re-entry or redundant conversion.

    void markUnsafe() {
      if (attribute<BinaryOpType>(1) == BinaryOpType::Max) {
        attribute<BinaryOpType>(1) = BinaryOpType::FMax;
      }
      if (attribute<BinaryOpType>(1) == BinaryOpType::Min) {
        attribute<BinaryOpType>(1) = BinaryOpType::FMin;
      }
    }

    return attribute<BinaryOpType>(1);
    }

    void markUnsafe() {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    TODO: Jacob recommended we get rid of this function, and instead replace the entire Expr with a new one.


    // Full-size statuses
    DEFAULT,
    BAD_BROADCAST,
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Still trying to understand the analysis, but wondering why we need a separate status for reduction and broadcast. Just having GOOD and BAD not enough?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It is still unclear to me why there is both DEFAULT and GOOD. I also don't understand why we need separate state for broadcasted BAD.

    for (auto input : expr->inputs()) {
    if (auto* in_tv = dynamic_cast<TensorView*>(input)) {
    for (IterDomain* id : in_tv->getLogicalDomain()) {
    IterDomainStatus status = iterMap[id];
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Is iterMap guaranteed to have a mapping for id? If so, let's use at so that we can mark iterMap as a const ref.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    These have changed names but the question is still valid:

    Is the map (NanStatusMap) guaranteed to have a mapping?

    No, the mapping may not exist for every node. For example:

    TensorView* tv1 = max(in0, {0, 1});
    TensorView* tv2 = add(in0, in2);
    

    The add node here has 2 inputs, but only the in0 TensorView will have a mapping during analysis. This is what the None state is for, it's the default state for unmapped TV's.

    Comment on lines 177 to 178
    IterDomainStatus status = iterMap[in_id];
    auto out_id = p2c[in_id];
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you avoid using [] as it's not very clear what is intended. Are you assuming the index has a mapping or are you relying on automatic addition of a new mapping?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I am relying on automatic addition of a new mapping to handle unmapped expression inputs. Their states will be "None".


    namespace nvfuser::preseg_passes {

    // IterDomainStatus are attached to IterDomains and propagated with a
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'm actually not sure iter domains are the right granularity of the analysis. If one iter domain has a bad status, its tensor should be considered bad as well. Also, reductions remove iter domains, so "bad" iter domains would just disappear from the fusion. It seems to me tensors are the right level of this analysis. What do you think?

    Copy link
    Collaborator

    @jacobhinkle jacobhinkle left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    More specific comments below. I think you should focus on explaining the algorithm and really thinking about what state is needed. I agree with @naoyam that it seems like only "good" and "bad" states are needed. Also, why not have an initialization step where all IDs of fusion inputs are marked GOOD instead of NONE?

    expectFMax = true;
    }

    if (testIndex == 3) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Tip: in cases like this I typically would create a new class for FMinFMaxPromotionTest instead of an alias. In there I would implement SetUp() and TearDown() and those would hold everything from this current test other than the if (testIndex == *) parts. That lets you directly give a descriptive name to each test, even without parametrization (i.e. you can use TEST_F instead of TEST_P then unless you have further parametrizations to do.

    Comment on lines 117 to 121
    // Once we identify a target reduction, we perform a downward pass starting from
    // the target's direct input. The pass propagates IterDomainStatus information.
    // At the end, we check all output TV's for bad statuses. If at any point we
    // encounter a node we don't know how to propagate information through, we treat
    // it like to a graph output and fail if it has any incoming bad statuses.
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In this comment, it would be very instructive to add a couple complete examples where you show a fusion and trace down through the fusion showing how the ID statuses propagate from a given max/min reduction.

    bool expectFMax = false;

    if (testIndex == 1) {
    TensorView* tv3 = add(max(tv0, {0, 1}), tv0);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It is often clearer to put one operation per line. This lets you mark the axes for each tensor in the fusion. For example, in this case you would have something like

        TensorView* tv3 = max(tv0, {0, 1}); // [ rS5{i0}, rS6{i1} ]
        // Note: the implicit broadcast tv4 here is not shown in your current code
        TensorView* tv4 = broadcast(tv3, {true, true});  // [ bS7{1}, bS8{1} ]
        TensorView* tv5 = add(tv4, tv0);  // [ iS9{i0}, iS10{i1} ]
        TensorView* tv6 = sum(tv5, {0, 1});  // [ rS11{i0}, rS12{i1} ]
        // NOTE: tv7 below is not shown currently either
        TensorView* tv7 = broadcast(tv6, {true, true});  // [ bS13{i0}, bS14{i1} ]
        TensorView* tv8 = add(tv5, tv7);  // [ iS15{i0}, iS16{i1} ]
        fusion->addOutput(tv8);

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Also a short comment can help indicate what it is we're testing in each case.

    @tbqh
    Copy link
    Collaborator Author

    tbqh commented Oct 19, 2025

    Pushed a new algorithm which should handle a lot of the issues with reductions / broadcasting not being supported. The new algorithm focuses on a single source IterDomain at a time, and then propagates information along TensorViews. This solves the issues that arise when tracking IterDomain's through reductions and broadcasts.

    The current code is messy and needs to be cleaned up.

    There is one unsolved issue which is handling sibling rewrites. So if we promote one fmax somewhere in the fusion, right now it can break other fmax's. I thought this could easily be solved by doing rewrites in reverse-topological order, however this does not solve the case for sibling expressions. This is tested by the test case #8 right now, which is the only failing test case.

    Comment on lines 129 to 130
    if (valMap[expr->input(0)->as<TensorView>()] == ValStatus::DEFAULT ||
    valMap[expr->input(0)->as<TensorView>()] == ValStatus::BAD_DEFAULT) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Nit:

    Suggested change
    if (valMap[expr->input(0)->as<TensorView>()] == ValStatus::DEFAULT ||
    valMap[expr->input(0)->as<TensorView>()] == ValStatus::BAD_DEFAULT) {
    auto *it = valMap.find(expr->input(0)->as<TensorView>());
    if (it == valMap.end() || it->second == ValStatus::DEFAULT ||
    it->second == ValStatus::BAD_DEFAULT) {

    If we expect input to always be found in valMap, then I'd do this instead:

          ValStatus in_status = valMap.at(expr->input(0)->as<TensorView>());
          if (in_status == ValStatus::DEFAULT || in_status == ValStatus::BAD_DEFAULT) {

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    We don't expect there to always be a value in the map, we utilize the default value being set to "None".

    This seems to be a very common comment on this PR, I guess we usually do not use the default value with unordered_map. Let me know if you want me to explicitly check whether a mapping exists (e.g. with .contains()). It's a lot more verbose to do so though.

    tbqh added 3 commits October 26, 2025 21:54
    - Function names start with lowercase letters
    - Use snake_case instead of camelCase
    - Add anonymous namespace to file-scoped things
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants