-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: enhance network topology binpack #4612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @MondayCha, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the Network Topology Aware plugin in the Volcano scheduler. It introduces a sophisticated binpacking mechanism designed to improve resource utilization and job placement by considering hierarchical network topologies. The goal is to consolidate smaller jobs and prevent resource fragmentation, thereby optimizing the environment for the efficient scheduling of future large jobs. This enhancement addresses the challenge of suboptimal scheduling where scattered small jobs hinder the placement of larger workloads.
Highlights
- Enhanced Binpacking Logic: Implements a new binpacking algorithm within the Network Topology Aware plugin to optimize resource utilization across hierarchical network topologies (e.g., Fat Tree).
- Configurable Binpacking Parameters: Introduces new arguments for the network-topology-aware plugin, allowing users to enable/disable binpacking, set a proportion for combining binpack and topology scores, and define weights for CPU, Memory, and custom resources (e.g., nvidia.com/gpu).
- Improved HyperNode Lookup Efficiency: Adds MinHyperNodeOfNodeMap to ClusterInfo and Session to provide a direct mapping from node names to their lowest-tier hypernode, streamlining lookups during scheduling.
- Hierarchical Capacity Initialization: Introduces logic to calculate and store total resource capacities for each hypernode in the hierarchy, crucial for binpacking decisions.
- Comprehensive Scoring Mechanism: The batchNodeScoringFn now combines network topology awareness with binpacking scores, considering resource consolidation, capacity, and tier-based weighting for optimal node selection.
- New Test and Benchmark Coverage: Adds dedicated files for unit tests and benchmarks for the new binpacking logic and related functions, ensuring correctness and performance.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the network topology aware plugin to support binpack functionality, optimizing resource placement across hierarchical network topologies. The enhancement introduces HyperNode-level resource usage aware binpacking to prevent resource fragmentation and improve scheduling efficiency for distributed jobs.
Key changes:
- Added comprehensive binpack configuration support with weighted resource scoring
- Implemented hierarchical scoring algorithm that combines topology awareness with resource consolidation
- Enhanced session management to include precomputed node-to-hypernode mappings for performance
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
network_topology_aware.go | Main plugin implementation with binpack integration and enhanced scoring functions |
network_topology_aware_binpack.go | Core binpack logic including hierarchical scoring and resource utilization calculations |
network_topology_aware_binpack_test.go | Comprehensive test suite for binpack configuration and scoring functions |
network_topology_aware_bench_test.go | Performance benchmarks for scoring algorithms |
session.go | Added MinHyperNodeOfNodeMap field for efficient node-to-hypernode lookups |
cache.go | Integration of MinHyperNodeOfNodeMap into cache snapshot |
hyper_node_info.go | Implementation of MinHyperNodeOfNodeMap generation |
cluster_info.go | Added MinHyperNodeOfNodeMap field to cluster info structure |
scheduler.yaml | Added scheduler configuration option for percentage-nodes-to-find |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
pkg/scheduler/plugins/network-topology-aware/network_topology_aware_binpack.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enhances the network topology aware plugin by introducing a binpacking scheduling strategy to optimize resource utilization across hierarchical network units. The changes are extensive, adding new configuration options, scoring logic, and helper functions. The overall approach of pre-calculating hierarchy capacities and tier weights for performance is sound. However, I've identified a critical bug in argument parsing that could prevent the plugin from starting, along with some opportunities for optimization and minor issues in the new tests. Addressing these points will make this a solid and valuable addition.
pkg/scheduler/plugins/network-topology-aware/network_topology_aware_binpack.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/plugins/network-topology-aware/network_topology_aware_binpack.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/plugins/network-topology-aware/network_topology_aware_binpack_test.go
Outdated
Show resolved
Hide resolved
pkg/scheduler/plugins/network-topology-aware/network_topology_aware_binpack_test.go
Outdated
Show resolved
Hide resolved
d0d5b91
to
e2e6f14
Compare
/assign @Monokaix |
e2e6f14
to
7be50c5
Compare
Signed-off-by: MondayCha <[email protected]>
7be50c5
to
3195fba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pkg/scheduler/plugins/network-topology-aware/network_topology_aware.go:1
- The length calculation logic is incorrect. When BinPackingResources is empty, you increment length, but when it's not empty, you add extendLength (which is len(w.BinPackingResources)). This inconsistent logic will cause incorrect slice allocation.
/*
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
/assign @JesseStutler |
What type of PR is this?
Network Topology Aware Plugin Enhancement.
What this PR does / why we need it:
The current Network Topology Aware plugin optimizes RDMA network topology (e.g., Fat Tree) for distributed jobs by scheduling pods to topologically close nodes. While Volcano already supports intra-job HyperNode binpacking (packing tasks of a single job into compact HyperNodes), it lacks HyperNode-Level Resource Usage Aware Binpacking across hierarchical units (e.g., Leaf, and Spine switches).
This causes suboptimal scheduling: Small jobs scattered across HyperNodes prevent placement of future large jobs.
Which issue(s) this PR fixes:
Fixes #4368
Special notes for your reviewer:
Does this PR introduce a user-facing change?