- 
                Notifications
    You must be signed in to change notification settings 
- Fork 403
Fix: After removing the device plugin from the gpu node, it can still be scheduled to this node #1430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| [APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: luohua13 The full list of commands accepted by this bot can be found here. 
Needs approval from an approver in each of these files:
 Approvers can indicate their approval by writing  | 
| Summary of ChangesHello @luohua13, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a bug where GPU nodes could still be scheduled for device-dependent workloads even after their associated device plugin was removed. The solution involves refining the device plugin's health check mechanism by standardizing the  Highlights
 Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either  
 Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a  Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a bug where nodes remain schedulable after a device plugin has been removed. The approach involves changing the handshake mechanism: the device plugin now periodically reports a 'Requesting' status with a timestamp. If this timestamp expires, the node is marked as unhealthy.
My review identifies a critical vulnerability in the CheckHealth function that could lead to a panic, and also points out that the original bug (node being schedulable with a missing handshake annotation) is not fully resolved. I've provided a code suggestion to make the function more robust by handling malformed or missing annotations correctly. I also recommend enhancing the test suite to cover these critical edge cases.
… be scheduled to this node Signed-off-by: luohua13 <[email protected]>
34c74b6    to
    73b4954      
    Compare
  
    | please pass the CI check | 
Signed-off-by: luohua13 <[email protected]>
* refactor: simplify configuration validation logic Signed-off-by: yxxhero <[email protected]> * refactor: streamline configuration validation logic Signed-off-by: yxxhero <[email protected]> --------- Signed-off-by: yxxhero <[email protected]>
…roject-HAMi#1441) * feat: add GeneratePodNamespaceName function and corresponding tests Signed-off-by: yxxhero <[email protected]> * Update pkg/util/nodelock/nodelock.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yxxhero <[email protected]> * fix lint Signed-off-by: yxxhero <[email protected]> * fix tests Signed-off-by: yxxhero <[email protected]> --------- Signed-off-by: yxxhero <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4...v5) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…#1443) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v5...v6) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ct-HAMi#1431) Signed-off-by: luohua13 <[email protected]>
Signed-off-by: Lei Guo <[email protected]> Co-authored-by: Lei Guo <[email protected]>
1. delete duplicate event type definition, use common package 2. support vendor device report custom event Signed-off-by: Lei Guo <[email protected]> Co-authored-by: Lei Guo <[email protected]>
What type of PR is this?
/kind bug
What this PR does / why we need it:
CheckHealthis always returntrue, truewhen removing the device plugin from the gpu node.hami.io/node-handshakekey on the node annotations. The correct way is for the device plugin to update regularly and the scheduler component to check the status.