Skip to content

Conversation

@ShubhamRwt
Copy link
Contributor

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going in the right direction. But I think it needs to go a bit deeper:

  • We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
  • If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

@ppatierno
Copy link
Member

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

@scholzj
Copy link
Member

scholzj commented Jul 17, 2025

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

@ShubhamRwt ShubhamRwt changed the title Added proposal for self-healing feature in operator Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025
Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did another pass. I have a few questions:

  • How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
  • You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
  • It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.

@nickgarvey
Copy link

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

@ppatierno
Copy link
Member

ppatierno commented Aug 19, 2025

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Some are nits, some are questions, etc. I feel like it would be great to have more clarifications on:

  • How are the anomalies removed from the CM
  • How exactly do we prevent repeated imbalances whcih cannot be fixed.

* from **RebalanceOnScaleDown** to:
* **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
* **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
* **Idle**: if a scale down operation was requested, it was executed and completed successfully/failed or a full rebalance was asked due to an anomaly but since the scale-down rebalance is done, we can ignore the anomalies assuming they are fixed by the rebalance. In case, they are not fixed, Cruise Control will detect them again and a new rebalance would be requested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the state about "imbalance" missing here?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason behind this is that if RebalanceOnScaleDown happened, we are assuming that the anomalies were also fixed by the rebalance that happened, therefore we will move to Idle `state. I have mentioned this the text below

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have included the RebalanceOnImbalance in the valid transitions for RebalanceOnScaleUp below? Why haven't you included it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-opened this because I had the same thought as Tom. The RebalanceOnImbalance is in the other diagram below from RebalanceOnScaleUp because a ConfigMap is detected so it should be also here after a scale down rebalance.

Signed-off-by: ShubhamRwt <[email protected]>
Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

Signed-off-by: ShubhamRwt <[email protected]>
@ShubhamRwt
Copy link
Contributor Author

Hello @scholzj @ppatierno @tinaselenge @kyguy @tomncooper @katheris @im-konge @see-quick @Frawless . I have addressed and resolved all the comments on the proposal. The proposal is ready for review again. It would be great if you can find some time to have another pass.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section on the StrimziCruiseControlNotifier, particularly the CM format needs review. It is not clear what the unfixable anomalies/goal list is actually going to contain? You also need to make clear what happens if all the detected anomalies are unfixable?


This generated ConfigMap will contain the list of goal violation anomalies in the section named `anomalyList`.
The `anomalyList` will refer to the anomalies that were detected by the anomaly detector.
The `anomalyList` will be populated by the notifier everytime it detects an anomaly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `anomalyList` will be populated by the notifier everytime it detects an anomaly
The `anomalyList` will be populated by the notifier everytime it detects an anomaly.

This generated ConfigMap will contain the list of goal violation anomalies in the section named `anomalyList`.
The `anomalyList` will refer to the anomalies that were detected by the anomaly detector.
The `anomalyList` will be populated by the notifier everytime it detects an anomaly
Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
Whenever a rebalance is about to be triggered by the operator (either in imbalance mode, remove-broker mode or add-broker mode) the `anomalyList` would be cleared.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this also applies to manually triggered rebalances?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? A manual rebalance is triggered by the creation of a KafkaRebalance resource by a user. There is no ConfigMap with anomalies, so no anomalyList to be cleared.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone has autorebalancing on imbalance configured and the notifier creates the CM when it detects an anomaly, but the user posts a manual KafkaRebalance before the reconcilation loop triggers the Operator to look at the anomaly CM?

Or is the operator able to watch for a specific CM resource, so it can trigger as soon as the anomaly CM is created/updated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got what you mean here, I would assume if the manual triggers first then yes the anomalyList will be cleared.

Or is the operator able to watch for a specific CM resource, so it can trigger as soon as the anomaly CM is created/updated?

Well the operator has a watch for both ... KafkaRebalance and ConfigMap changes ... so which comes first to be notified by Kube API.

The `anomalyList` will refer to the anomalies that were detected by the anomaly detector.
The `anomalyList` will be populated by the notifier everytime it detects an anomaly
Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.
We empty the list in order to make it easier to track the anomalies that were detected between rebalances.

The `anomalyList` will be populated by the notifier everytime it detects an anomaly
Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.
If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.
If the `anomalyList` isn't empty after a rebalance completes, that means that there were anomalies detected in whilst the rebalance was running.
We then need to make a decision on whether we run another rebalance for these detected anomalies or not.

If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.
Refer to this [section](./106-auto-rebalance-on-imbalanced-clusters.md#what-happens-if-anomaly-is-detected-while-the-auto-rebalancemanual-rebalance-is-happening) for more details.
In cases where the rebalance is happening for `remove-broker` and `add-broker` mode, we will put a check before the rebalance starts, that if there are some anomalies detected, we should make the `anomalyList` empty.
This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.
This is because it is possible that the rebalance will address some or all of the detected anomalies.


Cruise Control has the ability to figure out if a goal violation can be fixed by the generated optimization proposal or not.
The goal optimizer class does this job by doing analysis/dry-run if we can fix the anomaly by the generated proposal or not.
If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.
If a goal such as the `DiskDistributionUsage` goal is violated, but we cannot fix it since the all the disks are already completely populated, then it is marked as unfixable.

The goal optimizer class does this job by doing analysis/dry-run if we can fix the anomaly by the generated proposal or not.
If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.
We can retrieve those unfixable goals by using the `hasUnfixableGoals` utility provided by the `AnomalyDetectorUtils` class.
The notifier based on list returned by this utility checks if there are unfixable goals detected or not. If the list is empty then the notifier will set the action on the anomaly as `IGNORE` and that anomaly will be ignored.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what these two sentences mean? If there are no unfixable goals we ignore the anomaly? That doesn't sound right?

#### Metrics for tracking the rebalance requests

If the users want to track when the auto-rebalances happen, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) for the `KafkaRebalance` custom resources, this includes when they were visible/created.
These metrics also cover the `KafkaRebalance`(s) which were created automatically, so the users can utilize them to understand when an auto-rebalance were triggered in their cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do the metrics show which type of auto-rebalance was created? If not, I think we should add that label as part of this proposal.

If the user runs a `manual` rebalance, we will empty the `anomalyList` in those cases too.
If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place

The ConfigMap will also have an option field called `unfixedAnomaly` list which would be created only when an unfixable anomaly is detected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if all the detected anomalies are unfixable? Is an unfixable anomaly included in the list?

If all the anomalies are unfixable we shouldn't trigger a rebalance, as rebalances are triggered based on the presence of anomalies in the list then it follows we shouldn't add them if they are unfixable.

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the list would be unfixableGoalList. I will fix the name. If all the anomalies are unfixable they will be added to the unfixableGoal list. Our code will check the anomalyList for triggering the rebalance. So therefore no rebalance will happens if all the detected anomalies are unfixable

```mermaid
flowchart TB
A[KafkaClusterCreator] --creates--> B[KafkaCluster]
B -- calls --> D{Reconcilation in Progress}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean rebalance in progress?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shubham is really referring to the operator reconciliation, but of course I could be wrong :-)

But I have following comments:

  • I think the diagram should not show the KafkaClusterCreator and KafkaCluster which are part of the general flow and not important in describing the auto-rebalancing.
  • I can't map what you say on point 2 with the arrows "no" and "yes" going out of "Reconciliation in Progress" box.
  • isn'g "fix the anomalies" the same as "Trigger auto-rebalance" ?

I am confused sorry, I can't understand the goal of the diagram and how it maps on the description below.


The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are:
1. It provides a single point of reference for all the anomalies
2. When dealing multiple Kafka clusters, the ConfigMap created will be equal to number of clusters deployed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. When dealing multiple Kafka clusters, the ConfigMap created will be equal to number of clusters deployed
2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed

This is because its very much possible that the rebalance that will happen can fix all the detected anomalies.
The above statement holds true for the cases of `manual` rebalances too.
If the user runs a `manual` rebalance, we will empty the `anomalyList` in those cases too.
If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place
If some other anomalies are detected again after the rebalance then they will be updated in the ConfigMap by the notifier and another rebalance will take place.

The rebalance can be either auto-rebalance or manual rebalance.

The created ConfigMap will persist when self-healing is enabled and will be deleted if the user decides to disable self-healing.
This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the meaning of the two above sentences.
The first doubt is about what we mean about "self-healing" here, now that we don't use the CC feature anymore.
Also what does it mean that ConfigMap persists when self-healing is enabled? Where?
Finally, the second sentence is really not clear to me.

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By ConfigMap will persist I meant that the generated ConfigMap will not be deleted until the user decides to disable the imbalance mode of auto-rebalance. This is because the ConfigMap is keeping track of the timestamps which are important for us to tackle some of the edge cases.

I have now rephrased the paragraph to say something like:

The created ConfigMap will persist while the `imbalance` mode of `auto-rebalance` is enabled and will be deleted if the user decides to disable the `imbalance` mode.
This is done to be sure that we don't lose the timestamps data which will help us to avoid unnecessary rebalances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than "will persist" you would mean "will exist" maybe?

anomalyList: |
- <anomaly-id-1>
- <anomaly-id-2>
unfixableGoals: |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 even later ... I feel there is confusion between unfixable goal and unfixable anomaly, shouldn't it be one or the other?

RebalanceOnScaleDown-->>RebalanceOnScaleUp: then scale up
RebalanceOnScaleUp-->>RebalanceOnImbalance: then full rebalance
end
RebalanceOnScaleDown->>Idle: scale down complete with no queued items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to come so late with such a comment but I think that a sequence diagram is not the right choice to describe/show an FSM. You should just use a FSM diagram with transition between states having the triggers (and maybe outputs) on the arrows. Something like the first diagram.
In a sequence diagram, the boxes at the top are actors (humans, systems, or even class instances) which have a life (the vertical line) and they interact each other via method calls, messages and so on. For example, in your diagram, the RebalanceOnScaleDown looks like an actor calling on itself a "requested scale down" and then calling on the RebalanceOnScaleUp a "request scale up". It's not the right diagram choice imho. wdyt @tomncooper ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find these diagrams quite confusing. I think as @ppatierno suggested, simpler diagrams with states/triggers/arrows would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on this

* from **RebalanceOnScaleDown** to:
* **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
* **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
* **Idle**: if a scale down operation was requested, it was executed and completed successfully/failed or a full rebalance was asked due to an anomaly but since the scale-down rebalance is done, we can ignore the anomalies assuming they are fixed by the rebalance. In case, they are not fixed, Cruise Control will detect them again and a new rebalance would be requested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-opened this because I had the same thought as Tom. The RebalanceOnImbalance is in the other diagram below from RebalanceOnScaleUp because a ConfigMap is detected so it should be also here after a scale down rebalance.

```mermaid
flowchart TB
A[KafkaClusterCreator] --creates--> B[KafkaCluster]
B -- calls --> D{Reconcilation in Progress}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shubham is really referring to the operator reconciliation, but of course I could be wrong :-)

But I have following comments:

  • I think the diagram should not show the KafkaClusterCreator and KafkaCluster which are part of the general flow and not important in describing the auto-rebalancing.
  • I can't map what you say on point 2 with the arrows "no" and "yes" going out of "Reconciliation in Progress" box.
  • isn'g "fix the anomalies" the same as "Trigger auto-rebalance" ?

I am confused sorry, I can't understand the goal of the diagram and how it maps on the description below.

Copy link
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update Shubham. It's been a while since the last time I read the proposal, and can see quite a lot changed. I like the proposed solution to use the auto-rebalancing and let the operator handle the healing.

Overall I'm happy with the proposal, but I think there are a few parts that need to be ironed out with more clarity.


Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
If a user somehow tries to set the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence seems to be incomplete.

Whenever a rebalance is about to start by the operator(either it be in imbalance mode, remove-broker mode, add-broker mode) the `anomalyList` would be set to empty.
We empty the list earlier to make it easier to track the anomalies that were caught in between the rebalance.
If the `anomalyList` isn't empty after the rebalance that means that there were anomalies detected in between the rebalance, and we need to make decision on whether we run another rebalance for these detected anomalies or not.
Refer to this [section](./106-auto-rebalance-on-imbalanced-clusters.md#what-happens-if-anomaly-is-detected-while-the-auto-rebalancemanual-rebalance-is-happening) for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link to bring to another section does not seem to work.

The rebalance can be either auto-rebalance or manual rebalance.

The created ConfigMap will persist when self-healing is enabled and will be deleted if the user decides to disable self-healing.
This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances.
This is done to keep track of the timestamps which will help us to avoid unnecessary rebalances. This will be explained more later.


To tackle this problem we can make use of the `last_anomaly_detection_timestamp` and `rebalance_finished_timestamp` timestamps
Everytime an anomaly of the same type is detected we will treat its detection time as the `last_anomaly_detection_time` and update the value of it in the `last_anomaly_detection_time` field of the ConfigMap.
Now if the anomaly detection happened during the rebalance, the `ConfigMap` will be updated with the anomalies as well as the `last_anomaly_detection_time` will also be updated with last anomaly detection time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find these 2 sentences confusing. So whenever an anomaly is detected, and it is already in the anomaly list, we will update the timestamp. If it's not in the list, we also update the list with it. Is that what's trying to say?

The imbalance mode is configured by creating a new object in the `spec.cruiseControl.autoRebalance` list with its `mode` field set to `imbalance` and the corresponding rebalancing configuration is defined as a reference to a template `KafkaRebalance` custom resource, by using the `spec.cruiseControl.autoRebalance.template` field as a [LocalObjectReference](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/local-object-reference/).
The `template` field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to use whether it be `add-brokers`, `remove-brokers`, or `imbalance`.
When the auto-rebalance configuration is set with `imbalance` mode enabled, the operator will trigger a partition rebalance whenever a goal violation is detected by the anomaly detector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add Kafka CR example for this?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have shared the Kafka CR example after some lines(line 121)

If there is some goal, for example the `DiskDistributionUsage` goal is violated, and we cannot fix it since the all the disks are already completely populated then it is marked as unfixable.
We can retrieve those unfixable goals by using the `hasUnfixableGoals` utility provided by the `AnomalyDetectorUtils` class.
The notifier based on list returned by this utility checks if there are unfixable goals detected or not. If the list is empty then the notifier will set the action on the anomaly as `IGNORE` and that anomaly will be ignored.
The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR
The notifier will also update the ConfigMap with a `unfixableGoals` list and put the goal which cannot be fixed by rebalancing there so that the operator can then set a warning condition based on the unfixable anomaly in the status of the `Kafka` CR:

* from **Idle** to:
* **RebalanceOnScaleDown**: if a scale down operation was requested. This transition happens even if a scale up was requested at the same time but the rebalancing on scaling down has the precedence. The rebalancing on scale up is queued. They will run sequentially.
* **RebalanceOnScaleUp**: if only a scale up operation was requested. There was no scale down operation requested.
* **RebalanceOnImbalance**: if a ConfigMap related to goal violation was detected, and the `anomalyList` is not empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sentence needs to cover a similar information like the other 2 above in term of which operation has the precedence, if other operations requested at the same time. I understand the diagram tries to explain it but without a little explanation here, it's not easy to understand.

RebalanceOnScaleDown-->>RebalanceOnScaleUp: then scale up
RebalanceOnScaleUp-->>RebalanceOnImbalance: then full rebalance
end
RebalanceOnScaleDown->>Idle: scale down complete with no queued items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find these diagrams quite confusing. I think as @ppatierno suggested, simpler diagrams with states/triggers/arrows would be better.


1. The `KafkaClusterCreator` creates the `KafkaCluster` instance.
2. The `KafkaAutoRebalancingReconciler.reconcile()` will then check if there was any ConfigMap created with name `goal-violation-map` and whether the `anomalyList` list is empty or not, then the `full` rebalance(imbalance mode) would be performed if the list is not empty.
3. If a rebalance is already ongoing and more anomalies are detected, then the operator will just ignore the new anomalies based on the timestamps as discussed earlier in the proposal and delete all the anomalies from the `anomalyList` in the ConfigMap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these description matches the diagram very well. I understand the diagram is high level and then you have low level descriptions here, maybe that makes it confusing.

It is also the end state of a previous successfully completed or failed auto-rebalancing.
In case of successful completion, once the rebalance moves to `Ready` state, we will delete the `KafkaRebalance` and update the `auto-rebalance` state to `Idle`.
In case of failed auto-rebalancing, once the rebalance moves to `NotReady` state, we will follow the same procedure we used in successful completion.
In this state, the operator removes the finalizer and deletes the corresponding "actual" `KafkaRebalance` custom resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for failed case, we follow the same procedure as successful completion, which is to delete KafkaRebalance and update the state. So what does it mean when we say here "actual"? Is just the order of deleting and transitioning state is different between failed and successful completion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants