Skip to content

Conversation

@panpan0000
Copy link
Contributor

@panpan0000 panpan0000 commented May 23, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it

say, we have two LWS instance on the same nodes,
LWS-1 requires CPU
LWS-2 requires GPU
they should share the same topology , and the exclusive policy should apply inside LWS-1 and LWS-2 respectively.

But now, if LWS-1 occupied the nodes, the LWS-2 will be pending

Which issue(s) this PR fixes

Fixes #539

Special notes for your reviewer

Does this PR introduce a user-facing change?


@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2025
@k8s-ci-robot k8s-ci-robot requested a review from ardaguclu May 23, 2025 07:52
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: panpan0000
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented May 23, 2025

Deploy Preview for kubernetes-sigs-lws canceled.

Name Link
🔨 Latest commit cbb16cf
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/68302ba13b3c90000813f058

@k8s-ci-robot k8s-ci-robot requested a review from kerthcet May 23, 2025 07:52
@k8s-ci-robot
Copy link
Contributor

Hi @panpan0000. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 23, 2025
@yankay
Copy link
Member

yankay commented May 26, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2025
@Edwinhr716
Copy link
Contributor

/hold
Does this work if LWS "A" and LWS "B" have different names? We inject the Pod AntiAffinity of not having the GroupKey label, where GroupKey depends on the pod name, and the pod name changes based on LWS name.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2025
@panpan0000
Copy link
Contributor Author

panpan0000 commented May 30, 2025

/hold Does this work if LWS "A" and LWS "B" have different names? We inject the Pod AntiAffinity of not having the GroupKey label, where GroupKey depends on the pod name, and the pod name changes based on LWS name.

@Edwinhr716 Thank you for your review, I wrote down the explanation carefully as below:

Sample Use case

assuming we have 2 nodes and 1 zones

Node Topology (topology.kubernetes.io/zone:)
node1 x
node2 x

Assuming LWS-A request CPU only, and LWS-B request GPU

LWS-A as below

  annotations:
       leaderworkerset.sigs.k8s.io/exclusive-topology: topology.kubernetes.io/zone
spec:
  replicas: 3
  leaderWorkerTemplate:
    workerTemplate:
    size: 2

LWS-B as below

  annotations:
       leaderworkerset.sigs.k8s.io/exclusive-topology: topology.kubernetes.io/zone
spec:
  replicas: 2
  leaderWorkerTemplate:
    workerTemplate:
    size: 2

Current Situation

LWS Set Group Pod Schedule Expectation
A 0 0 on node1
A 0 1 follow its leader on node1
A 1 0 should not in node1, so go to node2
A 1 1 follow its leader on node2
A 2 0 Pending
A 2 1 Not Created
B 0 0 Pending
B 0 1 Not Created
B 1 0 Not Created
B 1 1 Not Created

Expectation :

LWS Set Group Pod Schedule Expectation
A 0 0 on node1
A 0 1 follow its leader on node1
A 1 0 should not in node1, so go to node2
A 1 1 follow its leader on node2
A 2 0 Pending
A 2 1 Not Created
B 0 0 on node1
B 0 1 follow its leader on node1
B 1 0 should not in node1, so go to node2
B 1 1 follow its leader on node2

Conclusion:

For each Leader, to be away from other group-leaders who belong to the same LWS Set instance:
the Anti-Affinity condition should be:
= (Same Namespace) AND (Same LWS Set name) AND (Different GroupID)

Previous condition is
= (Who has group-key label) AND (Whose group key varies)
= (Who has group-key label) AND [(namespace varies) OR (Leader pod name varies)]
= (Who has group-key label) AND [(namespace varies) OR (LWS set name varies) OR (group id varies)]

reference: groupUniqueKey = genGroupUniqueKey(pod.Namespace, pod.Name)

So for above previous condition, If there's already any LWS pod on zone-x who has different LWS-Set-Name, the later LWS pods will not be scheduled to zone-x.

My current PR condition is to anti-affinity with :
= (Same Namespace) AND (Same LWS Set Name ) [ (Diff namespace) OR (Diff LWS Set name) OR (Diff GroupID)]
= (Same Namespace) AND (Same LWS Set Name ) [ OR (Diff GroupID)]

(NOTE: the logic calculation will cancel out (Diff namespace) OR (Diff LWS Set name) )


currently my PR PR still retain and use below code for above condition in "[ ]" brackets:

				{
					Key:      podAffinityKey,
					Operator: metav1.LabelSelectorOpNotIn,
					Values:   []string{groupUniqueKey},
				},

But I think changing it to below will be more clear , What do. you think ?

				{
					Key:      leaderworkerset.GroupIndexLabelKey,
					Operator: metav1.LabelSelectorOpNotIn,
					pod.Labels[leaderworkerset.SubGroupIndexLabelKey],
				},

@Edwinhr716
Copy link
Contributor

= (Same Namespace) AND (Same LWS Set Name ) [ OR (Diff GroupID)]

How are you setting the OR here? If a pod has multiple pod anti-affinities they will be treated as AND no?

@panpan0000
Copy link
Contributor Author

panpan0000 commented Jun 3, 2025

= (Same Namespace) AND (Same LWS Set Name ) [ OR (Diff GroupID)]

How are you setting the OR here? If a pod has multiple pod anti-affinities they will be treated as AND no?

Oops, you found a typo @Edwinhr716
image

I tried to keep the original code GroupIDKey (LabelSelectorOpNotIn with SubGroupIndexLabelKey) to make minimal changes:

so my PR just adds a LabelSelectorOpIn with 'lwsName',
to achieve the final effect of expected = (Same Namespace) AND (Same LWS Set name) AND (Different GroupID)

Let me explain the The PR result again (below are mathematical logic operation calculation):

Anti Affinity 
= (Same LWS Set name) AND    (Different GroupID-Key)         <--- current PR code

// because "GroupID-Key = NS + LWSName + GroupID", so Derivation begins below:

= (Same LWS Set name) AND    [(Diff namespace) OR (Diff LWS Set name) OR (Diff GroupID) ] 

make it shorter :

A AND (B OR !A OR C) 
= (A AND B) OR (A AND !A) OR (A AND C)
= (A AND B) OR   False    OR (A AND C)
= (A AND B) OR (A AND C)

So the result is

= [ (Same LWS Set name) AND (Diff namespace) ] OR   [ (Same LWS Set name) AND (Diff GroupID) ]

Aha... with the GroupIDKey , everything become quite obscure T_T

That's why I asked your experts opinion: if we can remove GroupIDKey, but just plainly use LWSName & GroupIndex as condotion.

But I think changing it to below will be more clear , What do. you think ?

  		{
  			Key:      leaderworkerset.GroupIndexLabelKey,
  			Operator: metav1.LabelSelectorOpNotIn,
  			pod.Labels[leaderworkerset.SubGroupIndexLabelKey],
  		},

@panpan0000
Copy link
Contributor Author

panpan0000 commented Jun 3, 2025

= (A AND B) OR (A AND C)

So the result is

= [ (Same LWS Set name) AND (Diff namespace) ] OR   [ (Same LWS Set name) AND (Diff GroupID) ]

Aha... with the GroupIDKey , everything become quite obscure T_T

In short

the code logic can work, but very very obscure .

When if we abandon the GroupIDKey

it will be easier to understand like below, what do you think if we abandon GroupKey and change the code like this ? If yes, I could rework this PR?

    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: leaderworkerset.sigs.k8s.io/name
            operator: In
            values:
            - $LWS_NAME
          - key: leaderworkerset.sigs.k8s.io/group-index
            operator: Exists
          - key: leaderworkerset.sigs.k8s.io/group-index       <-------- change from GroupKEY to Group ID
            operator: NotIn
            values:
            - $MY_GROUP_ID
        namespaces:                 <-------- add namespace restriction 
        - $MY_NAMESPACE
        topologyKey: topology.kubernetes.io/$MY_TOPOLOGY_KEY

If you agree, I've another PR for this fix method no.2 :
#551

@Edwinhr716
Copy link
Contributor

Thanks for the thorough explanation, now it makes sense to me what happens if they have diff group ID.

One last question, what happens if LWS A and LWS B both request the same type of resource? Then exclusive placement won't be respected in that case correct?

@panpan0000
Copy link
Contributor Author

Thanks for the thorough explanation, now it makes sense to me what happens if they have diff group ID.

One last question, what happens if LWS A and LWS B both request the same type of resource? Then exclusive placement won't be respected in that case correct?

Good point , @Edwinhr716 .

let's assume LWS A & B both request GPU:
Denote 4 nodes with 8 GPU each (total 32 GPUs):

Case #1 ( no conflict )

  • LWS A : size 4, each leader & worker requests 4 GPU. LWS B is the same.
  • Then LWS A and B will be co-located in the 4 nodes, A uses 50% GPU and B uses other 50%.

Case #2: ( with conflict )

  • LWS A : size 4, each leader & worker requests 8 GPU. LWS B is the same.
  • If A comes first, A will occupied all 32 GPUs, B will not be scheduled. If some pod in A restarts , B's pod may preemp (it requires gang-scheduling, maybe now with help of Kueue?)
  • If A & B comes at the same time, it requires gang-scheduling( already in LWS roadmap) , otherwise, some pods in A or B will be pending and cause dead-lock

So, Do you think it's acceptable ?

@Edwinhr716
Copy link
Contributor

So in the case of 2, we are no longer guaranteeing exclusive placement, even if the flag is set.

@panpan0000
Copy link
Contributor Author

panpan0000 commented Jun 4, 2025

So in the case of 2, we are no longer guaranteeing exclusive placement, even if the flag is set.

hmm...yes. Do you think is there any way to statify both case 1 and 2? or case2 should wait gang being implemented?

@Edwinhr716
Copy link
Contributor

I don't think we should break existing functionality for the purpose of adding new one. @kerthcet @ahg-g @ardaguclu @yankay thoughts?

@panpan0000
Copy link
Contributor Author

@Edwinhr716 Or we add another policy field as a flag ?

@ahg-g
Copy link
Contributor

ahg-g commented Jun 10, 2025

We can do that with a parameter, perhaps call it exclusive-topology-scope that takes values like

  • "Global": means across all LWS instances
  • "WithinSameLWS" means within an LWS instance

@Edwinhr716
Copy link
Contributor

Adding a new field sounds good to me, though we should add a disclaimer that using the feature can cause a deadlock

@kerthcet
Copy link
Contributor

I was on holiday before, quote my comments here #539 (comment):

However, I think `You will see B will be pending due to A already live in those nodes (but B may use diff resource than A, they should not compete and A should not stop B from running )` 
exactly reflects the meaning of exclusive. And I don't think we should change the semantic.

Not objecting the idea but I would defer this until we see other requests.

though we should add a disclaimer that using the feature can cause a deadlock

What do you mean deadlock, if so, I would say we should avoid it.

@Edwinhr716
Copy link
Contributor

What do you mean deadlock, if so, I would say we should avoid it.

See

Case #2: ( with conflict )
LWS A : size 4, each leader & worker requests 8 GPU. LWS B is the same.
If A comes first, A will occupied all 32 GPUs, B will not be scheduled. If some pod in A restarts , B's pod may preemp (it requires gang-scheduling, maybe now with help of Kueue?)
If A & B comes at the same time, it requires gang-scheduling( already in LWS roadmap) , otherwise, some pods in A or B will be pending and cause dead-lock

@kerthcet
Copy link
Contributor

Got it, we should defer this until gang supports at least.

@panpan0000
Copy link
Contributor Author

panpan0000 commented Jun 16, 2025

#539 (comment)

copy my reply from #539 to here :

I would say the existing implementation combines topology-aware with exclusive together and kind of mixing up.

But actually, it would be better that topology-aware as a major function, and exclusive as a flag of sub-function, like

spec:
 ...
 topology:
       nodeSelectorKey: "xxx"          <-- it will add `nodeSelector:           topology.kubernetes.io/supernode: xxx` for pods, and affinity rule for same groups
       exclusive:  true                <-- it will add antiAffinity to other groups

So what's your suggestion ? @Edwinhr716 @ahg-g @kerthcet ?

(1) should we split the topology and exclusive semantic apart , like above yaml ?

or (2) adding a flag exclusive-topology-scope like @ahg-g said ?

"Global": means across all LWS instances
"WithinSameLWS" means within an LWS instance

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exclusive-topology antiAffinity should include matching LWS name

7 participants