WIP feat(bb_scheduler): allow cancellation timeouts to be configurable #173

tyler-french · 2025-06-12T14:52:01Z

We deal with patterns of widespread cancellations. In these situations, it helps to be able to have the executions cancel quickly after the builds are canceled, and generally 60s is too long for this.

This PR simply adds these timeout fields to be configurable, so that when the Bazel client is killed, the execution backend is also called.

EdSchouten · 2025-06-12T17:43:12Z

Even though I sympathize, I don't think that we should make a change like this as it stands. Operations can get abandoned for (at least) two different reasons:

Build failures, that cause the client to terminate eagerly.
Network connectivity issues.

For the former you think that the current value of 60 seconds is too high, which I can understand. But do we think that this value is also too high for the latter? I don't think so. Giving clients a minute to reconnect doesn't seem too bad. However, this change adds a single configuration knob to control both these cases. I think the core of the problem is that Bazel isn't clear about it intentions. Why can't it inform the scheduler that it is no longer interested in any of the operations it kicked off?

REv2 briefly mentions CancelOperation which is part of the LongRunning API. However that API is problematic, because it assumes a client "owns" a given operation, which may not necessarily be the case of in-flight deduplication. There's no guarantee that canceling an operation doesn't affect other builds. Life would have been easier if Execute() was a bidirectionally streaming RPC, allowing the client to abandon an operation by sending a message in band.

Maybe the best course of action is to work with the Remote API working group to see if the protocol can somehow be extended to properly facilitate this?

tyler-french · 2025-06-12T18:52:26Z

For the former you think that the current value of 60 seconds is too high, which I can understand. But do we think that this value is also too high for the latter?

Our situation is a cancellation at the Buildkite level, which calls a SIGINT to the process and then a SIGKILL, so Bazel probably wouldn't have time even to send a CancelOperation. Assuming 50% of our builds get canceled (it's probably a little more), we waste a lot of worker CPU time and concurrency with these. I agree that 60s is probably a good level for a network issue, but even setting this to 30s would probably cover most of the network interruptions.

We can probably keep it at 60s for now without much impact

tyler-french · 2025-06-18T16:10:23Z

Even if the RE2 was able to separate these two timeout settings, wouldn't we still want to make them configurable anyway? That way, we don't need to rebuild a container if we need to change it.

EdSchouten · 2025-06-18T16:14:44Z

If the protocol was extended to permit clients to cancel operations that are abandoned, there would be nothing to configure, as it would take effect immediately.

feat(bb_scheduler): allow cancellation timouts to be configurable

8a99b33

tyler-french changed the title ~~WIP feat(bb_scheduler): allow cancellation timouts to be configurable~~ WIP feat(bb_scheduler): allow cancellation timeouts to be configurable Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP feat(bb_scheduler): allow cancellation timeouts to be configurable #173

WIP feat(bb_scheduler): allow cancellation timeouts to be configurable #173

Uh oh!

tyler-french commented Jun 12, 2025

Uh oh!

EdSchouten commented Jun 12, 2025

Uh oh!

tyler-french commented Jun 12, 2025

Uh oh!

tyler-french commented Jun 18, 2025

Uh oh!

EdSchouten commented Jun 18, 2025

Uh oh!

Uh oh!

WIP feat(bb_scheduler): allow cancellation timeouts to be configurable #173

Are you sure you want to change the base?

WIP feat(bb_scheduler): allow cancellation timeouts to be configurable #173

Uh oh!

Conversation

tyler-french commented Jun 12, 2025

Uh oh!

EdSchouten commented Jun 12, 2025

Uh oh!

tyler-french commented Jun 12, 2025

Uh oh!

tyler-french commented Jun 18, 2025

Uh oh!

EdSchouten commented Jun 18, 2025

Uh oh!

Uh oh!