Add bulk replication throttle mode (set throttle once inter-broker) #2304

il-kyun · 2025-08-17T06:39:46Z

Summary

Why: Per-batch replication throttle set/clear causes excessive AdminClient calls when many batches involve the same brokers, significantly slowing rebalances. This PR proposes a bulk mode that sets the replication bytes throttle once before inter-broker movements and clears it once after, reducing Admin overhead and improving execution time.
What:
- Add bulk.replication.throttle.enabled (default: true) to ExecutorConfig.
- In Executor, when enabled:
  - Set throttles once before inter-broker partition movements begin (filtering to existing topics).
  - Clear throttles once after all inter-broker partition movements complete, with summary logging (completed/aborted/dead/in-progress/aborting).
- Keep existing per-batch behavior when disabled.
- Update tests to cover both bulk and non-bulk paths.
- Follow-ups: Improvements to the Kafka calls inside ReplicationThrottleHelper are being addressed in a separate PR; this PR focuses solely on bulk set/clear semantics.

Expected Behavior

With bulk.replication.throttle.enabled=true:
- Throttles are applied once at the start of inter-broker moves and cleared once at the end.
- Fewer AdminClient set/clear calls → faster rebalances.
- No functional change to movement semantics; only fewer config mutations.
With bulk.replication.throttle.enabled=false:
- Current per-batch set/clear behavior remains unchanged.

Actual Behavior

Throttles are currently set before each batch and cleared after each batch if no other tasks are active on the same broker.
When many batches affect the same brokers, the same throttle is repeatedly set/cleared.
Each Admin request can take seconds (~20s observed locally), causing noticeable slowdowns.

Steps to Reproduce

Prepare a rebalance plan with many inter-broker replica movements that repeatedly involve the same brokers.
Run the rebalance with current behavior (bulk disabled) and observe repeated set/clear throttle Admin calls per batch.
Measure end-to-end inter-broker phase duration.
Enable bulk.replication.throttle.enabled=true and re-run to observe reduced Admin calls and improved completion time.

Additional evidence

Environment (test settings):
- concurrency.adjuster.max.partition.movements.per.broker=12
- default.replica.movement.strategies=com.linkedin.kafka.cruisecontrol.executor.strategy.PrioritizeMinIsrWithOfflineReplicasStrategy,com.linkedin.kafka.cruisecontrol.executor.strategy.PrioritizeOneAboveMinIsrWithOfflineReplicasStrategy,com.linkedin.kafka.cruisecontrol.executor.strategy.PrioritizeSmallReplicaMovementStrategy,com.linkedin.kafka.cruisecontrol.executor.strategy.BaseReplicaMovementStrategy
Scenario and results:
- 900 partitions to rebalance (800 small, 100 large).
- Before (bulk disabled): moving the 800 small partitions took over 1 hour 30 minutes.
- After enabling bulk.replication.throttle.enabled: the 800 small partitions completed within a few minutes.
Logs: added summary logs when clearing bulk throttles (Completed/Aborted/Dead/InProgress/Aborting counts).
Follow-ups:
- We are improving the Kafka call patterns inside ReplicationThrottleHelper in a separate PR; this PR intentionally limits scope to the bulk set/clear behavior.

Categorization

This PR resolves #1972

…ar once after)

il-kyun · 2025-08-31T05:55:22Z

👋 @mhratson @CCisGG
I’ve submitted a PR for Cruise Control but haven’t received any reviews yet.
Would you mind taking a look when you get a chance? 🙏

kyguy

Looks good @il-kyun! I made a quick pass to help get this in front of the maintainers.

Is there any specific reason why the bulk replication throttle mode shouldn't be enabled by default? It seems like it would be helpful for most rebalances.

cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/Executor.java

il-kyun · 2025-09-16T16:01:11Z

@kyguy Thanks for starting the review on my PR, I really appreciate it!

Is there any specific reason why the bulk replication throttle mode shouldn't be enabled by default? It seems like it would be helpful for most rebalances.

I initially set bulk.replication.throttle.enabled to false to avoid surprising changes in existing deployments and to preserve backward compatibility.
I agree — I also think switching the default to true is better.

kyguy

Thanks for the updates! Just left some minor comments.

Similar to the related PR here: #2305 I wonder if it would be better to simply update the existing non-batching logic to this batching implementation instead of having it be configurable to save us the code complexity. I can't think of a reason why users would not want to batch requests like this. Anyways, I'll defer the maintainers on that!

cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ExecutorTest.java

il-kyun · 2025-09-19T09:58:08Z

Thanks for the updates! Just left some minor comments.

Similar to the related PR here: #2305 I wonder if it would be better to simply update the existing non-batching logic to this batching implementation instead of having it be configurable to save us the code complexity. I can't think of a reason why users would not want to batch requests like this. Anyways, I'll defer the maintainers on that!

I also think this is useful.
The reason I kept it configurable for now is that I wasn’t 100% sure if there might be any side effects with replacing the existing non-batching logic directly. If the maintainers and contributors feel comfortable with it, I’m happy to update the PR so that batching simply replaces the old logic — that would indeed make the code simpler. #2304 #2305

il-kyun added 2 commits August 17, 2025 15:19

Add bulk replication throttle mode (set once before inter-broker, cle…

62ef9e4

…ar once after)

Apply try-catch to recovery failure case

bf14514

fix spot but test

e8950c0

kyguy reviewed Sep 15, 2025

View reviewed changes

cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/Executor.java Outdated Show resolved Hide resolved

cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/Executor.java Outdated Show resolved Hide resolved

apply reviews

422ec93

il-kyun requested a review from kyguy September 16, 2025 16:02

kyguy reviewed Sep 18, 2025

View reviewed changes

cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ExecutorTest.java Outdated Show resolved Hide resolved

cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ExecutorTest.java Outdated Show resolved Hide resolved

consolidate proposal and execution setup in ExecutorTest

7c9bfa8

change to rerun build

4ec2f89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bulk replication throttle mode (set throttle once inter-broker) #2304

Add bulk replication throttle mode (set throttle once inter-broker) #2304

Uh oh!

il-kyun commented Aug 17, 2025 •

edited

Loading

Uh oh!

il-kyun commented Aug 31, 2025

Uh oh!

kyguy left a comment

Uh oh!

Uh oh!

Uh oh!

il-kyun commented Sep 16, 2025 •

edited

Loading

Uh oh!

kyguy left a comment

Uh oh!

Uh oh!

Uh oh!

il-kyun commented Sep 19, 2025

Uh oh!

Uh oh!

Add bulk replication throttle mode (set throttle once inter-broker) #2304

Are you sure you want to change the base?

Add bulk replication throttle mode (set throttle once inter-broker) #2304

Uh oh!

Conversation

il-kyun commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expected Behavior

Actual Behavior

Steps to Reproduce

Additional evidence

Categorization

Uh oh!

il-kyun commented Aug 31, 2025

Uh oh!

kyguy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

il-kyun commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyguy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

il-kyun commented Sep 19, 2025

Uh oh!

Uh oh!

il-kyun commented Aug 17, 2025 •

edited

Loading

il-kyun commented Sep 16, 2025 •

edited

Loading