fix: exclusive topology only affect inside LWS instance #540

panpan0000 · 2025-05-23T07:52:14Z

What type of PR is this?

/kind bug

What this PR does / why we need it

say, we have two LWS instance on the same nodes,
LWS-1 requires CPU
LWS-2 requires GPU
they should share the same topology , and the exclusive policy should apply inside LWS-1 and LWS-2 respectively.

But now, if LWS-1 occupied the nodes, the LWS-2 will be pending

Which issue(s) this PR fixes

Fixes #539

Special notes for your reviewer

Does this PR introduce a user-facing change?

k8s-ci-robot · 2025-05-23T07:52:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: panpan0000
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-05-23T07:52:20Z

✅ Deploy Preview for kubernetes-sigs-lws canceled.

Name	Link
🔨 Latest commit	`cbb16cf`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/68302ba13b3c90000813f058

k8s-ci-robot · 2025-05-23T07:52:23Z

Hi @panpan0000. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: Peter Pan <[email protected]>

yankay · 2025-05-26T06:29:10Z

/ok-to-test

Edwinhr716 · 2025-05-29T00:22:51Z

/hold
Does this work if LWS "A" and LWS "B" have different names? We inject the Pod AntiAffinity of not having the GroupKey label, where GroupKey depends on the pod name, and the pod name changes based on LWS name.

panpan0000 · 2025-05-30T05:48:46Z

/hold Does this work if LWS "A" and LWS "B" have different names? We inject the Pod AntiAffinity of not having the GroupKey label, where GroupKey depends on the pod name, and the pod name changes based on LWS name.

@Edwinhr716 Thank you for your review, I wrote down the explanation carefully as below:

Sample Use case

assuming we have 2 nodes and 1 zones

Node	Topology (topology.kubernetes.io/zone:)
node1	x
node2	x

Assuming LWS-A request CPU only, and LWS-B request GPU

LWS-A as below

  annotations:
       leaderworkerset.sigs.k8s.io/exclusive-topology: topology.kubernetes.io/zone
spec:
  replicas: 3
  leaderWorkerTemplate:
    workerTemplate:
    size: 2

LWS-B as below

  annotations:
       leaderworkerset.sigs.k8s.io/exclusive-topology: topology.kubernetes.io/zone
spec:
  replicas: 2
  leaderWorkerTemplate:
    workerTemplate:
    size: 2

Current Situation

LWS Set	Group	Pod	Schedule Expectation
A	0	0	on node1
A	0	1	follow its leader on node1
A	1	0	should not in node1, so go to node2
A	1	1	follow its leader on node2
A	2	0	Pending
A	2	1	Not Created
B	0	0	Pending
B	0	1	Not Created
B	1	0	Not Created
B	1	1	Not Created

Expectation :

LWS Set	Group	Pod	Schedule Expectation
A	0	0	on node1
A	0	1	follow its leader on node1
A	1	0	should not in node1, so go to node2
A	1	1	follow its leader on node2
A	2	0	Pending
A	2	1	Not Created
B	0	0	on node1
B	0	1	follow its leader on node1
B	1	0	should not in node1, so go to node2
B	1	1	follow its leader on node2

Conclusion:

For each Leader, to be away from other group-leaders who belong to the same LWS Set instance:
the Anti-Affinity condition should be:
= (Same Namespace) AND (Same LWS Set name) AND (Different GroupID)

Previous condition is
= (Who has group-key label) AND (Whose group key varies)
= (Who has group-key label) AND [(namespace varies) OR (Leader pod name varies)]
= (Who has group-key label) AND [(namespace varies) OR (LWS set name varies) OR (group id varies)]

reference: groupUniqueKey = genGroupUniqueKey(pod.Namespace, pod.Name)

So for above previous condition, If there's already any LWS pod on zone-x who has different LWS-Set-Name, the later LWS pods will not be scheduled to zone-x.

My current PR condition is to anti-affinity with :
= (Same Namespace) AND (Same LWS Set Name ) [ (Diff namespace) OR (Diff LWS Set name) OR (Diff GroupID)]
= (Same Namespace) AND (Same LWS Set Name ) [ OR (Diff GroupID)]

(NOTE: the logic calculation will cancel out (Diff namespace) OR (Diff LWS Set name) )

currently my PR PR still retain and use below code for above condition in "[ ]" brackets：

				{
					Key:      podAffinityKey,
					Operator: metav1.LabelSelectorOpNotIn,
					Values:   []string{groupUniqueKey},
				},

But I think changing it to below will be more clear , What do. you think ?

				{
					Key:      leaderworkerset.GroupIndexLabelKey,
					Operator: metav1.LabelSelectorOpNotIn,
					pod.Labels[leaderworkerset.SubGroupIndexLabelKey],
				},

Edwinhr716 · 2025-06-02T18:30:52Z

= (Same Namespace) AND (Same LWS Set Name ) [ OR (Diff GroupID)]

How are you setting the OR here? If a pod has multiple pod anti-affinities they will be treated as AND no?

panpan0000 · 2025-06-03T01:58:20Z

= (Same Namespace) AND (Same LWS Set Name ) [ OR (Diff GroupID)]

How are you setting the OR here? If a pod has multiple pod anti-affinities they will be treated as AND no?

Oops, you found a typo @Edwinhr716

I tried to keep the original code GroupIDKey (LabelSelectorOpNotIn with SubGroupIndexLabelKey) to make minimal changes:

so my PR just adds a LabelSelectorOpIn with 'lwsName',
to achieve the final effect of expected = (Same Namespace) AND (Same LWS Set name) AND (Different GroupID)

Let me explain the The PR result again (below are mathematical logic operation calculation):

Anti Affinity 
= (Same LWS Set name) AND    (Different GroupID-Key)         <--- current PR code

// because "GroupID-Key = NS + LWSName + GroupID", so Derivation begins below:

= (Same LWS Set name) AND    [(Diff namespace) OR (Diff LWS Set name) OR (Diff GroupID) ]

make it shorter :

A AND (B OR !A OR C) 
= (A AND B) OR (A AND !A) OR (A AND C)
= (A AND B) OR   False    OR (A AND C)
= (A AND B) OR (A AND C)

So the result is

= [ (Same LWS Set name) AND (Diff namespace) ] OR   [ (Same LWS Set name) AND (Diff GroupID) ]

Aha... with the GroupIDKey , everything become quite obscure T_T

That's why I asked your experts opinion: if we can remove GroupIDKey, but just plainly use LWSName & GroupIndex as condotion.

But I think changing it to below will be more clear , What do. you think ?

  		{
  			Key:      leaderworkerset.GroupIndexLabelKey,
  			Operator: metav1.LabelSelectorOpNotIn,
  			pod.Labels[leaderworkerset.SubGroupIndexLabelKey],
  		},

panpan0000 · 2025-06-03T03:08:45Z

= (A AND B) OR (A AND C)
So the result is
= [ (Same LWS Set name) AND (Diff namespace) ] OR   [ (Same LWS Set name) AND (Diff GroupID) ]
Aha... with the GroupIDKey , everything become quite obscure T_T

In short

the code logic can work, but very very obscure .

When if we abandon the GroupIDKey

it will be easier to understand like below, what do you think if we abandon GroupKey and change the code like this ? If yes, I could rework this PR?

    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: leaderworkerset.sigs.k8s.io/name
            operator: In
            values:
            - $LWS_NAME
          - key: leaderworkerset.sigs.k8s.io/group-index
            operator: Exists
          - key: leaderworkerset.sigs.k8s.io/group-index       <-------- change from GroupKEY to Group ID
            operator: NotIn
            values:
            - $MY_GROUP_ID
        namespaces:                 <-------- add namespace restriction 
        - $MY_NAMESPACE
        topologyKey: topology.kubernetes.io/$MY_TOPOLOGY_KEY

If you agree, I've another PR for this fix method no.2 :
#551

Edwinhr716 · 2025-06-04T00:13:00Z

Thanks for the thorough explanation, now it makes sense to me what happens if they have diff group ID.

One last question, what happens if LWS A and LWS B both request the same type of resource? Then exclusive placement won't be respected in that case correct?

panpan0000 · 2025-06-04T02:46:23Z

Thanks for the thorough explanation, now it makes sense to me what happens if they have diff group ID.

One last question, what happens if LWS A and LWS B both request the same type of resource? Then exclusive placement won't be respected in that case correct?

Good point , @Edwinhr716 .

let's assume LWS A & B both request GPU:
Denote 4 nodes with 8 GPU each (total 32 GPUs):

Case #1 ( no conflict )

LWS A : size 4, each leader & worker requests 4 GPU. LWS B is the same.
Then LWS A and B will be co-located in the 4 nodes, A uses 50% GPU and B uses other 50%.

Case #2: ( with conflict )

LWS A : size 4, each leader & worker requests 8 GPU. LWS B is the same.
If A comes first, A will occupied all 32 GPUs, B will not be scheduled. If some pod in A restarts , B's pod may preemp (it requires gang-scheduling, maybe now with help of Kueue?)
If A & B comes at the same time, it requires gang-scheduling( already in LWS roadmap) , otherwise, some pods in A or B will be pending and cause dead-lock

So, Do you think it's acceptable ?

Edwinhr716 · 2025-06-04T03:01:40Z

So in the case of 2, we are no longer guaranteeing exclusive placement, even if the flag is set.

panpan0000 · 2025-06-04T13:22:37Z

So in the case of 2, we are no longer guaranteeing exclusive placement, even if the flag is set.

hmm...yes. Do you think is there any way to statify both case 1 and 2？ or case2 should wait gang being implemented？

Edwinhr716 · 2025-06-09T15:35:06Z

I don't think we should break existing functionality for the purpose of adding new one. @kerthcet @ahg-g @ardaguclu @yankay thoughts?

panpan0000 · 2025-06-10T05:29:39Z

@Edwinhr716 Or we add another policy field as a flag ?

ahg-g · 2025-06-10T06:03:28Z

We can do that with a parameter, perhaps call it exclusive-topology-scope that takes values like

"Global": means across all LWS instances
"WithinSameLWS" means within an LWS instance

Edwinhr716 · 2025-06-12T17:23:07Z

Adding a new field sounds good to me, though we should add a disclaimer that using the feature can cause a deadlock

kerthcet · 2025-06-13T02:45:18Z

I was on holiday before, quote my comments here #539 (comment):

However, I think `You will see B will be pending due to A already live in those nodes (but B may use diff resource than A, they should not compete and A should not stop B from running )` 
exactly reflects the meaning of exclusive. And I don't think we should change the semantic.

Not objecting the idea but I would defer this until we see other requests.

though we should add a disclaimer that using the feature can cause a deadlock

What do you mean deadlock, if so, I would say we should avoid it.

Edwinhr716 · 2025-06-13T02:49:15Z

What do you mean deadlock, if so, I would say we should avoid it.

See

Case #2: ( with conflict )
LWS A : size 4, each leader & worker requests 8 GPU. LWS B is the same.
If A comes first, A will occupied all 32 GPUs, B will not be scheduled. If some pod in A restarts , B's pod may preemp (it requires gang-scheduling, maybe now with help of Kueue?)
If A & B comes at the same time, it requires gang-scheduling( already in LWS roadmap) , otherwise, some pods in A or B will be pending and cause dead-lock

kerthcet · 2025-06-13T04:05:47Z

Got it, we should defer this until gang supports at least.

panpan0000 · 2025-06-16T04:04:44Z

#539 (comment)

copy my reply from #539 to here :

I would say the existing implementation combines topology-aware with exclusive together and kind of mixing up.

But actually, it would be better that topology-aware as a major function, and exclusive as a flag of sub-function, like

spec:
 ...
 topology:
       nodeSelectorKey: "xxx"          <-- it will add `nodeSelector:           topology.kubernetes.io/supernode: xxx` for pods, and affinity rule for same groups
       exclusive:  true                <-- it will add antiAffinity to other groups

So what's your suggestion ? @Edwinhr716 @ahg-g @kerthcet ?

(1) should we split the topology and exclusive semantic apart , like above yaml ?

or (2) adding a flag exclusive-topology-scope like @ahg-g said ?

"Global": means across all LWS instances
"WithinSameLWS" means within an LWS instance

k8s-triage-robot · 2025-09-14T04:24:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-10-14T04:58:45Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2025

k8s-ci-robot requested a review from ardaguclu May 23, 2025 07:52

k8s-ci-robot requested a review from kerthcet May 23, 2025 07:52

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 23, 2025

fix: exclusive topolgy only affect inside LWS instance

cbb16cf

Signed-off-by: Peter Pan <[email protected]>

panpan0000 force-pushed the fix_affinity branch from 073d99f to cbb16cf Compare May 23, 2025 08:02

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2025

panpan0000 mentioned this pull request Jun 3, 2025

fix: exclusive topology only affect inside LWS instance #551

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 14, 2025

fix: exclusive topology only affect inside LWS instance #540

Are you sure you want to change the base?

fix: exclusive topology only affect inside LWS instance #540

Conversation

panpan0000 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented May 23, 2025

Uh oh!

netlify bot commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-lws canceled.

Uh oh!

k8s-ci-robot commented May 23, 2025

Uh oh!

yankay commented May 26, 2025

Uh oh!

Edwinhr716 commented May 29, 2025

Uh oh!

panpan0000 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample Use case

Current Situation

Expectation :

Conclusion:

Uh oh!

Edwinhr716 commented Jun 2, 2025

Uh oh!

panpan0000 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panpan0000 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In short

When if we abandon the GroupIDKey

Uh oh!

Edwinhr716 commented Jun 4, 2025

Uh oh!

panpan0000 commented Jun 4, 2025

Uh oh!

Edwinhr716 commented Jun 4, 2025

Uh oh!

panpan0000 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edwinhr716 commented Jun 9, 2025

Uh oh!

panpan0000 commented Jun 10, 2025

Uh oh!

ahg-g commented Jun 10, 2025

Uh oh!

Edwinhr716 commented Jun 12, 2025

Uh oh!

kerthcet commented Jun 13, 2025

Uh oh!

Edwinhr716 commented Jun 13, 2025

Uh oh!

kerthcet commented Jun 13, 2025

Uh oh!

panpan0000 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-triage-robot commented Sep 14, 2025

Uh oh!

k8s-triage-robot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

panpan0000 commented May 23, 2025 •

edited

Loading

netlify bot commented May 23, 2025 •

edited

Loading

panpan0000 commented May 30, 2025 •

edited

Loading

panpan0000 commented Jun 3, 2025 •

edited

Loading

panpan0000 commented Jun 3, 2025 •

edited

Loading

panpan0000 commented Jun 4, 2025 •

edited

Loading

panpan0000 commented Jun 16, 2025 •

edited

Loading