Skip to content

Should Nursery.start() be error-interruptible by cancellation? #2258

Open
@goodboy

Description

@goodboy

Synopsis

A somewhat convoluted/complex case I discovered while refining a "service manager" style system in tractor is where a so called "service nursery" experiences the following situation:

  • a first task is requested to start async in the background via Nursery.start_soon()
  • that same nursery's .start() is called to start another 2nd task which can execute (in theory async) up to its required .started() where it unblocks the caller but, between the parent:.start() -> child:.started() sequence may need synchronization with the previous first bg task (the syncing implemented with a trio.Event for example).
  • the first task errors and normally would be processed on Nursery exit

For further background on what the real-world multi-actor daemon service system looks like see this test. The TL;DR is that having a service nursery in the global state of a process (in tractor terms we call this actor local) is fairly useful for spawning long lived daemon service-actors which can be started and stopped from some (other) remote program (not in the process tree) where the lifetime of the daemons (async tasks running in processes) are still lifetime-managed using persistent trio nurseries.


Example of the issue

Here is a small pytest ready module which demonstrates how this scenario causes a hang:

import pytest
import trio
from trio_typing import TaskStatus


def test_stashed_child_nursery():

    _child_nursery = None

    async def waits_on_signal(
        ev: trio.Event(),
        task_status: TaskStatus[trio.Nursery] = trio.TASK_STATUS_IGNORED,
    ):
        '''
        Do some stuf, then signal other tasks, then yield back to "starter".

        '''
        await ev.wait()
        task_status.started()

    async def mk_child_nursery(
        task_status: TaskStatus = trio.TASK_STATUS_IGNORED,
    ):
        '''
        Allocate a child sub-nursery and stash it as a global.

        '''
        nonlocal _child_nursery

        async with trio.open_nursery() as cn:
            _child_nursery = cn
            task_status.started(cn)

            # block until cancelled by parent.
            await trio.sleep_forever()

    async def sleep_and_err(ev: trio.Event):
        await trio.sleep(0.5)
        doggy()
        ev.set()

    async def main():

        async with (
            trio.open_nursery() as pn,
        ):
            cn = await pn.start(mk_child_nursery)
            assert cn

            ev = trio.Event()
            cn.start_soon(sleep_and_err, ev)

            # this causes inf hang
            await cn.start(waits_on_signal, ev)

            # this does not.
            # cn.start_soon(waits_on_signal, ev)

    with pytest.raises(NameError):
        trio.run(main)

Analysis and discussion

@njsmith mentioned in chat

so what ends up happening is: sleep_and_err crashes with an error -> cn cancels everything inside it and tries to propagate the NameError -> but cn can't exit, because waits_on_signal is in a weird state where it's still executing in the parent, but cn knows that a start call is in progress. So cn waits to see what waits_on_signal is going to do -- if it exited, or if it called started, then either way cn can continue and propagate the NameError. But because start does neither, cn gets stuck forever, so the NameError can't propagate, so it can't reach pn and cause the start call to be cancelled.

I'm not sure if this suggests a problem in trio or not!

right now started always succeeds -- if start makes sure that the nursery is open and that it will stay open until the child calls started.

what I could imagine doing instead, is to say that the nursery closes while start is running, then started raises a RuntimeError, same as if you do start_soon into a closed nursery

The main question being which of the following 2 approaches should trio take:

  • keep .start() -> .started() never nursery-error-interruptible
  • the opposite such that this example test would not hang (my personal preference)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions