Skip to content

Conversation

MondayCha
Copy link
Member

@MondayCha MondayCha commented Sep 12, 2025

What type of PR is this?

Network Topology Aware Plugin Enhancement.

What this PR does / why we need it:

The current Network Topology Aware plugin optimizes RDMA network topology (e.g., Fat Tree) for distributed jobs by scheduling pods to topologically close nodes. While Volcano already supports intra-job HyperNode binpacking (packing tasks of a single job into compact HyperNodes), it lacks HyperNode-Level Resource Usage Aware Binpacking across hierarchical units (e.g., Leaf, and Spine switches).

This causes suboptimal scheduling: Small jobs scattered across HyperNodes prevent placement of future large jobs.

Which issue(s) this PR fixes:

Fixes #4368

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Configure the following parameters to ensure proper network topology-aware scheduling:

# volcano-scheduler deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: volcano-scheduler
spec:
  template:
    spec:
      containers:
      - name: scheduler
        image: volcano-scheduler:latest
        command:
        - /vc-scheduler
        - --percentage-nodes-to-find=100  # Critical: must evaluate all nodes to guarantee optimal topology

Enable and optimize the network topology plugin in the scheduler configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-config
data:
  volcano-scheduler.conf: |
    actions: "allocate, backfill"
    tiers:
    - plugins:
      - name: network-topology-aware  # Enable network topology plugin
        arguments:
          weight: 20                  # Plugin weight (impacts scheduling decisions)
          binpack.cpu: 1              # CPU resource weight
          binpack.memory: 2           # Memory resource weight
          binpack.resources: nvidia.com/gpu  # Custom resources
          binpack.resources.nvidia.com/gpu: 10  # GPU weight (higher)

Through the above configurations and usage guidelines, users can fully leverage Volcano's network topology-aware scheduling capabilities to optimize the performance and resource utilization of distributed workloads in complex network environments.

@Copilot Copilot AI review requested due to automatic review settings September 12, 2025 15:10
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign monokaix
You can assign the PR to them by writing /assign @monokaix in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 12, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @MondayCha, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Network Topology Aware plugin in the Volcano scheduler. It introduces a sophisticated binpacking mechanism designed to improve resource utilization and job placement by considering hierarchical network topologies. The goal is to consolidate smaller jobs and prevent resource fragmentation, thereby optimizing the environment for the efficient scheduling of future large jobs. This enhancement addresses the challenge of suboptimal scheduling where scattered small jobs hinder the placement of larger workloads.

Highlights

  • Enhanced Binpacking Logic: Implements a new binpacking algorithm within the Network Topology Aware plugin to optimize resource utilization across hierarchical network topologies (e.g., Fat Tree).
  • Configurable Binpacking Parameters: Introduces new arguments for the network-topology-aware plugin, allowing users to enable/disable binpacking, set a proportion for combining binpack and topology scores, and define weights for CPU, Memory, and custom resources (e.g., nvidia.com/gpu).
  • Improved HyperNode Lookup Efficiency: Adds MinHyperNodeOfNodeMap to ClusterInfo and Session to provide a direct mapping from node names to their lowest-tier hypernode, streamlining lookups during scheduling.
  • Hierarchical Capacity Initialization: Introduces logic to calculate and store total resource capacities for each hypernode in the hierarchy, crucial for binpacking decisions.
  • Comprehensive Scoring Mechanism: The batchNodeScoringFn now combines network topology awareness with binpacking scores, considering resource consolidation, capacity, and tier-based weighting for optimal node selection.
  • New Test and Benchmark Coverage: Adds dedicated files for unit tests and benchmarks for the new binpacking logic and related functions, ensuring correctness and performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the network topology aware plugin to support binpack functionality, optimizing resource placement across hierarchical network topologies. The enhancement introduces HyperNode-level resource usage aware binpacking to prevent resource fragmentation and improve scheduling efficiency for distributed jobs.

Key changes:

  • Added comprehensive binpack configuration support with weighted resource scoring
  • Implemented hierarchical scoring algorithm that combines topology awareness with resource consolidation
  • Enhanced session management to include precomputed node-to-hypernode mappings for performance

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
network_topology_aware.go Main plugin implementation with binpack integration and enhanced scoring functions
network_topology_aware_binpack.go Core binpack logic including hierarchical scoring and resource utilization calculations
network_topology_aware_binpack_test.go Comprehensive test suite for binpack configuration and scoring functions
network_topology_aware_bench_test.go Performance benchmarks for scoring algorithms
session.go Added MinHyperNodeOfNodeMap field for efficient node-to-hypernode lookups
cache.go Integration of MinHyperNodeOfNodeMap into cache snapshot
hyper_node_info.go Implementation of MinHyperNodeOfNodeMap generation
cluster_info.go Added MinHyperNodeOfNodeMap field to cluster info structure
scheduler.yaml Added scheduler configuration option for percentage-nodes-to-find

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the network topology aware plugin by introducing a binpacking scheduling strategy to optimize resource utilization across hierarchical network units. The changes are extensive, adding new configuration options, scoring logic, and helper functions. The overall approach of pre-calculating hierarchy capacities and tier weights for performance is sound. However, I've identified a critical bug in argument parsing that could prevent the plugin from starting, along with some opportunities for optimization and minor issues in the new tests. Addressing these points will make this a solid and valuable addition.

@MondayCha MondayCha force-pushed the network-topology-binpack branch 4 times, most recently from d0d5b91 to e2e6f14 Compare September 12, 2025 15:38
@MondayCha
Copy link
Member Author

/assign @Monokaix

@MondayCha MondayCha marked this pull request as draft September 18, 2025 07:36
@MondayCha MondayCha force-pushed the network-topology-binpack branch from e2e6f14 to 7be50c5 Compare September 18, 2025 14:21
@volcano-sh-bot volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 18, 2025
@MondayCha MondayCha marked this pull request as ready for review September 18, 2025 14:22
@MondayCha MondayCha force-pushed the network-topology-binpack branch from 7be50c5 to 3195fba Compare September 18, 2025 14:22
@MondayCha MondayCha requested a review from Copilot September 18, 2025 14:25
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

pkg/scheduler/plugins/network-topology-aware/network_topology_aware.go:1

  • The length calculation logic is incorrect. When BinPackingResources is empty, you increment length, but when it's not empty, you add extendLength (which is len(w.BinPackingResources)). This inconsistent logic will cause incorrect slice allocation.
/*

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@MondayCha
Copy link
Member Author

/assign @JesseStutler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Enhance Network Topology Aware Scheduling with HyperNode-Level Resource Usage Aware Binpacking

4 participants