Skip to content

Conversation

tyler-french
Copy link
Contributor

We deal with patterns of widespread cancellations. In these situations, it helps to be able to have the executions cancel quickly after the builds are canceled, and generally 60s is too long for this.

This PR simply adds these timeout fields to be configurable, so that when the Bazel client is killed, the execution backend is also called.

@tyler-french tyler-french changed the title WIP feat(bb_scheduler): allow cancellation timouts to be configurable WIP feat(bb_scheduler): allow cancellation timeouts to be configurable Jun 12, 2025
@EdSchouten
Copy link
Member

Even though I sympathize, I don't think that we should make a change like this as it stands. Operations can get abandoned for (at least) two different reasons:

  • Build failures, that cause the client to terminate eagerly.
  • Network connectivity issues.

For the former you think that the current value of 60 seconds is too high, which I can understand. But do we think that this value is also too high for the latter? I don't think so. Giving clients a minute to reconnect doesn't seem too bad. However, this change adds a single configuration knob to control both these cases. I think the core of the problem is that Bazel isn't clear about it intentions. Why can't it inform the scheduler that it is no longer interested in any of the operations it kicked off?

REv2 briefly mentions CancelOperation which is part of the LongRunning API. However that API is problematic, because it assumes a client "owns" a given operation, which may not necessarily be the case of in-flight deduplication. There's no guarantee that canceling an operation doesn't affect other builds. Life would have been easier if Execute() was a bidirectionally streaming RPC, allowing the client to abandon an operation by sending a message in band.

Maybe the best course of action is to work with the Remote API working group to see if the protocol can somehow be extended to properly facilitate this?

@tyler-french
Copy link
Contributor Author

For the former you think that the current value of 60 seconds is too high, which I can understand. But do we think that this value is also too high for the latter?

Our situation is a cancellation at the Buildkite level, which calls a SIGINT to the process and then a SIGKILL, so Bazel probably wouldn't have time even to send a CancelOperation. Assuming 50% of our builds get canceled (it's probably a little more), we waste a lot of worker CPU time and concurrency with these. I agree that 60s is probably a good level for a network issue, but even setting this to 30s would probably cover most of the network interruptions.

We can probably keep it at 60s for now without much impact

@tyler-french
Copy link
Contributor Author

Even if the RE2 was able to separate these two timeout settings, wouldn't we still want to make them configurable anyway? That way, we don't need to rebuild a container if we need to change it.

@EdSchouten
Copy link
Member

If the protocol was extended to permit clients to cancel operations that are abandoned, there would be nothing to configure, as it would take effect immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants