Skip to content

Conversation

Chiro11
Copy link
Contributor

@Chiro11 Chiro11 commented Sep 16, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

We must register new outputs when creating an actor; otherwise, we will meet a deadlock.

  • For the new created table, we must create new executors, then create its corresponding dispatcher for downstream, then we can do execute().
  • Creating the new dispatcher requires waiting for the downstream job to register the new output.
  • The Merge operator in downstream job must receive a barrier from the old table's fragment-graph before it can obtain the mutation, create a new upstream, and register new output.
  • This means that the old job must process the barrier and send it to downstream normally before the new job is created.
  • This means that the upstream job's dispatcher must also be able to send the barrier normally.
  • The upstream job's dispatcher, when receiving the barrier, must wait for the new downstream table to register its output before sending the barrier to both the old and new downstream tables.

So, this becomes:

create-new-job -> downstream-receive-barrier -> old-job-process-barrier -> upstream-dispatch-barrier -> create-new-job

which will cause deadlocks, so we must register the upstream output when creating the actor.

Close #23064

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • I have checked the Release Timeline and Currently Supported Versions to determine which release branches I need to cherry-pick this PR into.

Documentation

  • My PR needs documentation updates.
Release note

@github-actions github-actions bot added the type/fix Type: Bug fix. Only for pull requests. label Sep 16, 2025
@Chiro11 Chiro11 changed the title fix(sink): Fix dead lock in alter-table with both upstream and downstream fix(sink): Fix dead lock in replace-job with both upstream and downstream Sep 16, 2025
@BugenZhao
Copy link
Member

BugenZhao commented Sep 17, 2025

I wonder which step in your explanation of deadlock is the newly introduced behavior from UpstreamSinkUnion? Because it seems not reproducible in v2.5.1 release.


impl UpstreamSinkUnionExecutor {
pub fn new(
pub async fn new(
Copy link
Member

@BugenZhao BugenZhao Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments to show that when this function might be blocking (i.e. not immediately ready)? Seems like the origin is RemoteInput::new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here we need to wait for establishing stream-rpc with the upstream actor.

@Chiro11
Copy link
Contributor Author

Chiro11 commented Sep 17, 2025

I wonder which step in your explanation of deadlock is the newly introduced behavior from UpstreamSinkUnion? Because it seems not reproducible in v2.5.1 release.

Only on the main branch. UpstreamSinkUnion operator has been temporarily banned on the release branch.

@wenym1
Copy link
Contributor

wenym1 commented Sep 18, 2025

Discussed offline.

The dependency of upstream-dispatch-barrier -> create-new-job is avoidable. This dependency is due to the current problematic implementation of Merge, rather than the high-level design. The dependency comes from the current implementation that, even though the barrier with update mutation has been sent from local barrier manager, we still have to wait for upstream barrier before we apply the update mutation and create inputs to new upstream.

To solve this, we can change to, while polling upstreaming messages, we can concurrently poll new barrier with mutation from barrier_rx, and then create new inputs when needed, so that the two won't have any dependency, and then the mentioned deadlock can be avoided.

@Chiro11
Copy link
Contributor Author

Chiro11 commented Sep 23, 2025

Merged to #23273

@Chiro11 Chiro11 closed this Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Type: Bug fix. Only for pull requests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to add column to table that has both upstream sink and downstream dependency
3 participants