Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

ShubhamRwt · 2025-07-14T11:45:21Z

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <[email protected]>

106-self-healing-feature-in-operator.md

tomncooper

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

106-self-healing-feature-in-operator.md

scholzj

I think this is going in the right direction. But I think it needs to go a bit deeper:

We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

106-self-healing-feature-in-operator.md

ppatierno · 2025-07-17T08:37:54Z

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

scholzj · 2025-07-17T16:22:17Z

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

Signed-off-by: ShubhamRwt <[email protected]>

106-auto-rebalance-on-imbalanced-clusters.md

tomncooper

Ok I did another pass. I have a few questions:

How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.

106-auto-rebalance-on-imbalanced-clusters.md

nickgarvey · 2025-08-18T18:42:54Z

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

ppatierno · 2025-08-19T11:32:42Z

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

106-auto-rebalance-on-imbalanced-clusters.md

scholzj

I left some comments. Some are nits, some are questions, etc. I feel like it would be great to have more clarifications on:

How are the anomalies removed from the CM
How exactly do we prevent repeated imbalances whcih cannot be fixed.

106-auto-rebalance-on-imbalanced-clusters.md

ppatierno · 2025-10-20T08:43:52Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **Idle**: if a scale down operation was requested, it was executed and completed successfully/failed or a full rebalance was asked due to an anomaly but since the scale-down rebalance is done, we can ignore the anomalies assuming they are fixed by the rebalance. In case, they are not fixed, Cruise Control will detect them again and a new rebalance would be requested.


why is the state about "imbalance" missing here?

The reason behind this is that if RebalanceOnScaleDown happened, we are assuming that the anomalies were also fixed by the rebalance that happened, therefore we will move to Idle `state. I have mentioned this the text below

But you have included the RebalanceOnImbalance in the valid transitions for RebalanceOnScaleUp below? Why haven't you included it here?

I re-opened this because I had the same thought as Tom. The RebalanceOnImbalance is in the other diagram below from RebalanceOnScaleUp because a ConfigMap is detected so it should be also here after a scale down rebalance.

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

tomncooper

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

106-auto-rebalance-on-imbalanced-clusters.md

tomncooper

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt · 2025-11-07T11:27:58Z

Hello @scholzj @ppatierno @tinaselenge @kyguy @tomncooper @katheris @im-konge @see-quick @Frawless . I have addressed and resolved all the comments on the proposal. The proposal is ready for review again. It would be great if you can find some time to have another pass.

tomncooper

The section on the StrimziCruiseControlNotifier, particularly the CM format needs review. It is not clear what the unfixable anomalies/goal list is actually going to contain? You also need to make clear what happens if all the detected anomalies are unfixable?

tomncooper · 2025-11-07T11:57:59Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This generated ConfigMap will contain the list of goal violation anomalies in the section named `anomalyList`.
+The `anomalyList` will refer to the anomalies that were detected by the anomaly detector.
+The `anomalyList` will be populated by the notifier everytime it detects an anomaly


Suggested change

The `anomalyList` will be populated by the notifier everytime it detects an anomaly

The `anomalyList` will be populated by the notifier everytime it detects an anomaly.

tomncooper · 2025-11-07T11:59:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+This generated ConfigMap will contain the list of goal violation anomalies in the section named `anomalyList`.
+The `anomalyList` will refer to the anomalies that were detected by the anomaly detector.
+The `anomalyList` will be populated by the notifier everytime it detects an anomaly
+Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.


Suggested change

Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.

Whenever a rebalance is about to be triggered by the operator (either in imbalance mode, remove-broker mode or add-broker mode) the `anomalyList` would be cleared.

I assume this also applies to manually triggered rebalances?

Why? A manual rebalance is triggered by the creation of a KafkaRebalance resource by a user. There is no ConfigMap with anomalies, so no anomalyList to be cleared.

What if someone has autorebalancing on imbalance configured and the notifier creates the CM when it detects an anomaly, but the user posts a manual KafkaRebalance before the reconcilation loop triggers the Operator to look at the anomaly CM?

Or is the operator able to watch for a specific CM resource, so it can trigger as soon as the anomaly CM is created/updated?

I got what you mean here, I would assume if the manual triggers first then yes the anomalyList will be cleared.

Or is the operator able to watch for a specific CM resource, so it can trigger as soon as the anomaly CM is created/updated?

Well the operator has a watch for both ... KafkaRebalance and ConfigMap changes ... so which comes first to be notified by Kube API.

tomncooper · 2025-11-07T12:02:28Z

106-auto-rebalance-on-imbalanced-clusters.md

+The `anomalyList` will refer to the anomalies that were detected by the anomaly detector.
+The `anomalyList` will be populated by the notifier everytime it detects an anomaly
+Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
+We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.


Suggested change

We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.

We empty the list in order to make it easier to track the anomalies that were detected between rebalances.

tomncooper · 2025-11-07T12:03:44Z

106-auto-rebalance-on-imbalanced-clusters.md

+The `anomalyList` will be populated by the notifier everytime it detects an anomaly
+Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
+We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.
+If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.


Suggested change

If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.

If the `anomalyList` isn't empty after a rebalance completes, that means that there were anomalies detected in whilst the rebalance was running.

We then need to make a decision on whether we run another rebalance for these detected anomalies or not.

tomncooper · 2025-11-07T12:08:50Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.
+Refer to this [section](./106-auto-rebalance-on-imbalanced-clusters.md#what-happens-if-anomaly-is-detected-while-the-auto-rebalancemanual-rebalance-is-happening) for more details.
+In cases where the rebalance is happening for `remove-broker` and `add-broker` mode, we will put a check before the rebalance starts, that if there are some anomalies detected, we should make the `anomalyList` empty.
+This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.


Suggested change

This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.

This is because it is possible that the rebalance will address some or all of the detected anomalies.

tomncooper · 2025-11-07T12:19:36Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Cruise Control has the ability to figure out if a goal violation can be fixed by the generated optimization proposal or not.
+The goal optimizer class does this job by doing analysis/dry-run if we can fix the anomaly by the generated proposal or not.
+If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.


Suggested change

If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.

If a goal such as the `DiskDistributionUsage` goal is violated, but we cannot fix it since the all the disks are already completely populated, then it is marked as unfixable.

tomncooper · 2025-11-07T12:21:57Z

106-auto-rebalance-on-imbalanced-clusters.md

+The goal optimizer class does this job by doing analysis/dry-run if we can fix the anomaly by the generated proposal or not.
+If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.
+We can retrieve those unfixable goals by using the `hasUnfixableGoals` utility provided by the `AnomalyDetectorUtils` class.
+The notifier based on list returned by this utility checks if there are unfixable goals detected or not. If the list is empty then the notifier will set the action on the anomaly as `IGNORE` and that anomaly will be ignored.


I don't know what these two sentences mean? If there are no unfixable goals we ignore the anomaly? That doesn't sound right?

tomncooper · 2025-11-07T12:23:35Z

106-auto-rebalance-on-imbalanced-clusters.md

+#### Metrics for tracking the rebalance requests
+
+If the users want to track when the auto-rebalances happen, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) for the `KafkaRebalance` custom resources, this includes when they were visible/created.
+These metrics also cover the `KafkaRebalance`(s) which were created automatically, so the users can utilize them to understand when an auto-rebalance were triggered in their cluster.


So do the metrics show which type of auto-rebalance was created? If not, I think we should add that label as part of this proposal.

tomncooper · 2025-11-07T12:26:19Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the user runs a `manual` rebalance, we will empty the `anomalyList` in those cases too.
+If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place
+
+The ConfigMap will also have an option field called `unfixedAnomaly` list which would be created only when an unfixable anomaly is detected.


What if all the detected anomalies are unfixable? Is an unfixable anomaly included in the list?

If all the anomalies are unfixable we shouldn't trigger a rebalance, as rebalances are triggered based on the presence of anomalies in the list then it follows we shouldn't add them if they are unfixable.

The name of the list would be unfixableGoalList. I will fix the name. If all the anomalies are unfixable they will be added to the unfixableGoal list. Our code will check the anomalyList for triggering the rebalance. So therefore no rebalance will happens if all the detected anomalies are unfixable

tomncooper · 2025-11-07T12:41:05Z

106-auto-rebalance-on-imbalanced-clusters.md

+```mermaid
+flowchart TB
+  A[KafkaClusterCreator] --creates--> B[KafkaCluster]
+  B -- calls --> D{Reconcilation in Progress}


Do you mean rebalance in progress?

I think Shubham is really referring to the operator reconciliation, but of course I could be wrong :-)

But I have following comments:

I think the diagram should not show the KafkaClusterCreator and KafkaCluster which are part of the general flow and not important in describing the auto-rebalancing.

I can't map what you say on point 2 with the arrows "no" and "yes" going out of "Reconciliation in Progress" box.

isn'g "fix the anomalies" the same as "Trigger auto-rebalance" ?

I am confused sorry, I can't understand the goal of the diagram and how it maps on the description below.

ppatierno · 2025-11-07T15:02:46Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are:
+1. It provides a single point of reference for all the anomalies
+2. When dealing multiple Kafka clusters, the ConfigMap created will be equal to number of clusters deployed


Suggested change

2. When dealing multiple Kafka clusters, the ConfigMap created will be equal to number of clusters deployed

2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed

ppatierno · 2025-11-07T15:09:28Z

106-auto-rebalance-on-imbalanced-clusters.md

+This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.
+The above statement holds true for the cases of `manual` rebalances too. 
+If the user runs a `manual` rebalance, we will empty the `anomalyList` in those cases too.
+If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place


Suggested change

If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place

If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place.

ppatierno · 2025-11-07T15:13:14Z

106-auto-rebalance-on-imbalanced-clusters.md

+The rebalance can be either auto-rebalance or manual rebalance.
+
+The created ConfigMap will persist when self-healing is enabled and will be deleted if the user decides to disable self-healing.
+This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.


I don't get the meaning of the two above sentences.
The first doubt is about what we mean about "self-healing" here, now that we don't use the CC feature anymore.
Also what does it mean that ConfigMap persists when self-healing is enabled? Where?
Finally, the second sentence is really not clear to me.

By ConfigMap will persist I meant that the generated ConfigMap will not be deleted until the user decides to disable the imbalance mode of auto-rebalance. This is because the ConfigMap is keeping track of the timestamps which are important for us to tackle some of the edge cases.

I have now rephrased the paragraph to say something like:

The created ConfigMap will persist while the `imbalance` mode of `auto-rebalance` is enabled and will be deleted if the user decides to disable the `imbalance` mode. This is done to be sure that we don't lose the timestamps data which will help us to avoid unnecessary rebalances.

More than "will persist" you would mean "will exist" maybe?

ppatierno · 2025-11-07T15:16:10Z

106-auto-rebalance-on-imbalanced-clusters.md

+  anomalyList: |
+    - <anomaly-id-1>
+    - <anomaly-id-2>
+  unfixableGoals: |


+1 even later ... I feel there is confusion between unfixable goal and unfixable anomaly, shouldn't it be one or the other?

ppatierno · 2025-11-07T15:27:12Z

106-auto-rebalance-on-imbalanced-clusters.md

+    RebalanceOnScaleDown-->>RebalanceOnScaleUp: then scale up
+    RebalanceOnScaleUp-->>RebalanceOnImbalance: then full rebalance
+  end
+  RebalanceOnScaleDown->>Idle: scale down complete with no queued items


sorry to come so late with such a comment but I think that a sequence diagram is not the right choice to describe/show an FSM. You should just use a FSM diagram with transition between states having the triggers (and maybe outputs) on the arrows. Something like the first diagram.
In a sequence diagram, the boxes at the top are actors (humans, systems, or even class instances) which have a life (the vertical line) and they interact each other via method calls, messages and so on. For example, in your diagram, the RebalanceOnScaleDown looks like an actor calling on itself a "requested scale down" and then calling on the RebalanceOnScaleUp a "request scale up". It's not the right diagram choice imho. wdyt @tomncooper ?

I find these diagrams quite confusing. I think as @ppatierno suggested, simpler diagrams with states/triggers/arrows would be better.

I am working on this

ppatierno · 2025-11-07T15:31:06Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **Idle**: if a scale down operation was requested, it was executed and completed successfully/failed or a full rebalance was asked due to an anomaly but since the scale-down rebalance is done, we can ignore the anomalies assuming they are fixed by the rebalance. In case, they are not fixed, Cruise Control will detect them again and a new rebalance would be requested.


I re-opened this because I had the same thought as Tom. The RebalanceOnImbalance is in the other diagram below from RebalanceOnScaleUp because a ConfigMap is detected so it should be also here after a scale down rebalance.

ppatierno · 2025-11-07T15:36:42Z

106-auto-rebalance-on-imbalanced-clusters.md

+```mermaid
+flowchart TB
+  A[KafkaClusterCreator] --creates--> B[KafkaCluster]
+  B -- calls --> D{Reconcilation in Progress}


I think Shubham is really referring to the operator reconciliation, but of course I could be wrong :-)

But I have following comments:

I think the diagram should not show the KafkaClusterCreator and KafkaCluster which are part of the general flow and not important in describing the auto-rebalancing.

I can't map what you say on point 2 with the arrows "no" and "yes" going out of "Reconciliation in Progress" box.

isn'g "fix the anomalies" the same as "Trigger auto-rebalance" ?

I am confused sorry, I can't understand the goal of the diagram and how it maps on the description below.

tinaselenge

Thank you for the update Shubham. It's been a while since the last time I read the proposal, and can see quite a lot changed. I like the proposed solution to use the auto-rebalancing and let the operator handle the healing.

Overall I'm happy with the proposal, but I think there are a few parts that need to be ironed out with more clarity.

tinaselenge · 2025-11-10T17:39:04Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
+All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
+If a user somehow tries to set the 


This sentence seems to be incomplete.

tinaselenge · 2025-11-11T10:09:48Z

106-auto-rebalance-on-imbalanced-clusters.md

+Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
+We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.
+If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.
+Refer to this [section](./106-auto-rebalance-on-imbalanced-clusters.md#what-happens-if-anomaly-is-detected-while-the-auto-rebalancemanual-rebalance-is-happening) for more details.


this link to bring to another section does not seem to work.

tinaselenge · 2025-11-11T10:24:36Z

106-auto-rebalance-on-imbalanced-clusters.md

+The rebalance can be either auto-rebalance or manual rebalance.
+
+The created ConfigMap will persist when self-healing is enabled and will be deleted if the user decides to disable self-healing.
+This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.


Suggested change

This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.

This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances. This will be explained more later.

tinaselenge · 2025-11-11T10:39:13Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+To tackle this problem we can make use of the `last_anomaly_detection_timestamp` and `rebalance_finished_timestamp` timestamps
+Everytime an anomaly of the same type is detected we will treat its detection time as the `last_anomaly_detection_time` and update the value of it in the `last_anomaly_detection_time` field of the ConfigMap.
+Now if the anomaly detection happened during the rebalance, the `ConfigMap` will be updated with the anomalies as well as the `last_anomaly_detection_time` will also be updated with last anomaly detection time.


I find these 2 sentences confusing. So whenever an anomaly is detected, and it is already in the anomaly list, we will update the timestamp. If it's not in the list, we also update the list with it. Is that what's trying to say?

tinaselenge · 2025-11-11T10:46:00Z

106-auto-rebalance-on-imbalanced-clusters.md

+The imbalance mode is configured by creating a new object in the `spec.cruiseControl.autoRebalance` list with its `mode` field set to `imbalance` and the corresponding rebalancing configuration is defined as a reference to a template `KafkaRebalance` custom resource, by using the `spec.cruiseControl.autoRebalance.template` field as a [LocalObjectReference](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/local-object-reference/).
+The `template` field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
+To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to use whether it be `add-brokers`, `remove-brokers`, or `imbalance`.
+When the auto-rebalance configuration is set with `imbalance` mode enabled, the operator will trigger a partition rebalance whenever a goal violation is detected by the anomaly detector.


can we add Kafka CR example for this?

I have shared the Kafka CR example after some lines(line 121)

tinaselenge · 2025-11-11T10:56:54Z

106-auto-rebalance-on-imbalanced-clusters.md

+If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.
+We can retrieve those unfixable goals by using the `hasUnfixableGoals` utility provided by the `AnomalyDetectorUtils` class.
+The notifier based on list returned by this utility checks if there are unfixable goals detected or not. If the list is empty then the notifier will set the action on the anomaly as `IGNORE` and that anomaly will be ignored.
+The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR


Suggested change

The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR

The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR:

tinaselenge · 2025-11-11T11:05:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **Idle** to:
+  * **RebalanceOnScaleDown**: if a scale down operation was requested. This transition happens even if a scale up was requested at the same time but the rebalancing on scaling down has the precedence. The rebalancing on scale up is queued. They will run sequentially.
+  * **RebalanceOnScaleUp**: if only a scale up operation was requested. There was no scale down operation requested.
+  * **RebalanceOnImbalance**: if a ConfigMap related to goal violation was detected, and the `anomalyList` is not empty


I think this sentence needs to cover a similar information like the other 2 above in term of which operation has the precedence, if other operations requested at the same time. I understand the diagram tries to explain it but without a little explanation here, it's not easy to understand.

tinaselenge · 2025-11-11T11:12:40Z

106-auto-rebalance-on-imbalanced-clusters.md

+    RebalanceOnScaleDown-->>RebalanceOnScaleUp: then scale up
+    RebalanceOnScaleUp-->>RebalanceOnImbalance: then full rebalance
+  end
+  RebalanceOnScaleDown->>Idle: scale down complete with no queued items


I find these diagrams quite confusing. I think as @ppatierno suggested, simpler diagrams with states/triggers/arrows would be better.

tinaselenge · 2025-11-11T11:16:49Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+1. The `KafkaClusterCreator` creates the `KafkaCluster` instance.
+2. The `KafkaAutoRebalancingReconciler.reconcile()` will then check if there was any ConfigMap created with name  `goal-violation-map` and whether the `anomalyList` list is empty or not, then the `full` rebalance(imbalance mode) would be performed if the list is not empty.
+3. If a rebalance is already ongoing and more anomalies are detected, then the operator will just ignore the new anomalies based on the timestamps as discussed earlier in the proposal and delete all the anomalies from the `anomalyList` in the ConfigMap.


I don't think these description matches the diagram very well. I understand the diagram is high level and then you have low level descriptions here, maybe that makes it confusing.

tinaselenge · 2025-11-11T12:11:57Z

106-auto-rebalance-on-imbalanced-clusters.md

+It is also the end state of a previous successfully completed or failed auto-rebalancing.
+In case of successful completion, once the rebalance moves to `Ready` state, we will delete the `KafkaRebalance` and update the `auto-rebalance` state to `Idle`.
+In case of failed auto-rebalancing, once the rebalance moves to `NotReady` state, we will follow the same procedure we used in successful completion.
+In this state, the operator removes the finalizer and deletes the corresponding "actual" `KafkaRebalance` custom resource.


So for failed case, we follow the same procedure as successful completion, which is to delete KafkaRebalance and update the state. So what does it mean when we say here "actual"? Is just the order of deleting and transitioning state is different between failed and successful completion?

Signed-off-by: ShubhamRwt <[email protected]>

Added proposal for self-healing

abb4253

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt requested review from ppatierno, scholzj and tomncooper July 14, 2025 11:45

ppatierno reviewed Jul 15, 2025

View reviewed changes

tomncooper reviewed Jul 15, 2025

View reviewed changes

ppatierno requested review from Frawless, PaulRMellor, im-konge, katheris, kyguy, samuel-hawker, see-quick, sknot-rh and tombentley July 15, 2025 17:12

scholzj reviewed Jul 16, 2025

View reviewed changes

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

ShubhamRwt added 3 commits July 23, 2025 13:55

Added suggestions by Jakub, Tom and Paolo

9f538ff

Signed-off-by: ShubhamRwt <[email protected]>

Updated the heading

bf1e952

Signed-off-by: ShubhamRwt <[email protected]>

Added future scope section

a8ce365

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt changed the title ~~Added proposal for self-healing feature in operator~~ Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025

ppatierno reviewed Aug 1, 2025

View reviewed changes

tomncooper reviewed Aug 1, 2025

View reviewed changes

im-konge reviewed Aug 8, 2025

View reviewed changes

tinaselenge reviewed Aug 11, 2025

View reviewed changes

see-quick reviewed Aug 20, 2025

View reviewed changes

Frawless reviewed Aug 20, 2025

View reviewed changes

Added suggestions made on the proposal

cc48117

Signed-off-by: ShubhamRwt <[email protected]>

kyguy reviewed Sep 19, 2025

View reviewed changes

ShubhamRwt added 3 commits September 22, 2025 15:50

Refine the proposal

66cb053

Signed-off-by: ShubhamRwt <[email protected]>

Pushed example yamls

f6c95c0

Signed-off-by: ShubhamRwt <[email protected]>

Minor Edits

39da9e1

Signed-off-by: ShubhamRwt <[email protected]>

kyguy reviewed Sep 22, 2025

View reviewed changes

scholzj reviewed Oct 6, 2025

View reviewed changes

ppatierno reviewed Oct 20, 2025

View reviewed changes

Added suggestions

65fada5

Signed-off-by: ShubhamRwt <[email protected]>

tomncooper reviewed Oct 31, 2025

View reviewed changes

Added suggestions by Tom and Jakub

9453f7f

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt force-pushed the newHealing branch from 9eff83e to 9453f7f Compare November 7, 2025 11:08

Minor fixes

7a81346

Signed-off-by: ShubhamRwt <[email protected]>

tomncooper reviewed Nov 7, 2025

View reviewed changes

ppatierno reviewed Nov 7, 2025

View reviewed changes

tinaselenge reviewed Nov 11, 2025

View reviewed changes

Added suggestions by Tina, Tom and Paolo

cf66d98

Signed-off-by: ShubhamRwt <[email protected]>

	The `anomalyList` will be populated by the notifier everytime it detects an anomaly
	The `anomalyList` will be populated by the notifier everytime it detects an anomaly.

	Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
	Whenever a rebalance is about to be triggered by the operator (either in imbalance mode, remove-broker mode or add-broker mode) the `anomalyList` would be cleared.

	We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.
	We empty the list in order to make it easier to track the anomalies that were detected between rebalances.

	If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.
	If the `anomalyList` isn't empty after a rebalance completes, that means that there were anomalies detected in whilst the rebalance was running.
	We then need to make a decision on whether we run another rebalance for these detected anomalies or not.

	This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.
	This is because it is possible that the rebalance will address some or all of the detected anomalies.

	If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.
	If a goal such as the `DiskDistributionUsage` goal is violated, but we cannot fix it since the all the disks are already completely populated, then it is marked as unfixable.

	2. When dealing multiple Kafka clusters, the ConfigMap created will be equal to number of clusters deployed
	2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed

	If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place
	If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place.

	This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.
	This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances. This will be explained more later.

	The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR
	The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR:

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Are you sure you want to change the base?

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Uh oh!

Conversation

ShubhamRwt commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ppatierno commented Jul 17, 2025

Uh oh!

scholzj commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment •

edited

Loading

ppatierno commented Aug 19, 2025 •

edited

Loading

ShubhamRwt Oct 29, 2025 •

edited

Loading