Fix several supervisor issues and bugs #1958

UncleGrumpy · 2025-11-04T02:21:43Z

These changes fix several open supervisor issues, as well as a few small bugs discovered in the supervisor and test_supervisor.erl.

Implements missing one_for_all restart strategy that is documented in the module.
Supervisors now obeys intensity and period options.
Children who fail to restart are retried until maximum restart intensity limit is reached
Fixes edge case bug when terminating as well as several typos.
Fixes some bugs in the test suite that left un-received messages in the test environment that would be received by additional tests added to the end of the lists in test_supervisor:/test/0.

Closes #1855
Closes #1915
Closed #1957

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

Fix a missing edge case in loop_terminate/2 that could lead to loop_wait_termination/1 getting stuck in an endless receive, preventing shutdown, if a later child crashes before loop_terminate/2 has itterated through the children to the now dead one. Fixes a typo in the init funtion error return where the atom `mod` was returned rather than the modules name (`Mod`). Fixes a typo in child record `type`, adds missing parenthesis on the `child_type/0` type. Signed-off-by: Winford <[email protected]>

Fixes some tests to catch the messages from child ping_pong_servers so they will not be received by later tests. Changes the supervisor started in test_count_children to use a generic supervisor without a child, rather than using the modules start_link which always starts a child ping_pong_server that sends messages back to the test environment. Fixes the test_ping_ping final `recieve` to match on a new `Pid4` (which would indicate a restart) rather than `Pid3` that the monitor just caught the 'DOWN' message for. Signed-off-by: Winford <[email protected]>

Adds support for handling restarts for the `one_for_all` strategy that was documented, but lacked implimentation. Closes atomvm#1855 Signed-off-by: Winford <[email protected]>

Adds support for honoring intensity and period supervisor options, allowing the prevention of endless crash and restarts by misbehaving children. Adds a test to ensure the settings work as expected on AtomVM and match OTP. Closes atomvm#1915 Signed-off-by: Winford <[email protected]>

tests/libs/estdlib/test_supervisor.erl

Fixes a function clause exception that would crash the supervisor if a child fails to restart by matching the same behavior as OTP, and continue to try restarting the child until success, or maximum `intensity` is reached within the allowed `period`. Closed atomvm#1957 Signed-off-by: Winford <[email protected]>

pguyot

Thank you for this work!

I looked at OTP's implementation and we're doing things differently, which may or may not be a good thing :)

OTP's terminate_child/2 is synchronous and I like the asynchronous implementation we currently have. Yet, it introduces some complexity.

I wonder if we should move further and make this entirely state based, i.e. avoid send_after/3 entirely and rely on timeout feature of gen_server instead for the general case, or just plain receive of exit messages for the terminate case. It may make this more easy to test or more pure. Also be careful of processing of messages with send_after because we don't know if the child is still there.

libs/estdlib/src/supervisor.erl

pguyot · 2025-11-06T20:38:01Z

libs/estdlib/src/supervisor.erl

+                                {ok, NewChildren} ->
+                                    {ok, State1#state{children = NewChildren}};
+                                {ok, NewChildren, RetryID} ->
+                                    erlang:send_after(50, self(), {try_again_restart, RetryID}),


It's a different logic, but I would favor a timeout return for the gen_server over send_after/3, with a queue of restarts we need to tackle.

A timeout does make more sense here, that is closer to what I originally envisioned. I did end up implementing most of my changes differently internally than OTP does, mostly for the sake of simplicity and/or smaller memory footprint, I just wanted the behaviors to be the same for the user.

OTP does not use any delay between restart attempts, part of the reason I used a send_after was to give a slight delay that might allow for a gc or other external change that might improve the chances of a successful restart of the child, but maybe we should just immediately (re)queue all restart attempts.

Sadly, our current gen_server implementation does not support {ok, State, {timeout, Time, Message} returns from callbacks. I was thinking that adding hibernation support might be nice (especially for the benefit of supervisors), but I think supporting timeout actions in returns is even more important.

I didn't know about the timeout tuple that didn't exist when I learnt about gen_server. We should indeed implement it and meanwhile we could use send_after in this PR unless the timeout tuple is merged first.

I didn't know about the timeout tuple that didn't exist when I learnt about gen_server. We should indeed implement it and meanwhile we could use send_after in this PR unless the timeout tuple is merged first.

It looks perfect for this situation, but I am struggling to get a test to pass on OTP using it. I did all of this work with test driven development, writing the test for OTP first, and then making sure my implementation passes as expected. It looks like a fairly simple improvement, once I understand how it’s supposed to work ;-)

This is an OTP 28 feature. See #1961. Using this to process multiple restarts, and update their true state individually as they are restarted is likely the best solution for the discussion below about the use of is_process_alive/1.

With one_for_all when a child exits immediately before the supervisor gets an exit signal for itself the supervisor never has a chance to exit because it is stuck looping through trying to kill a process that is already gone. My test fails every time if I remove the is_process_alive/1 check.

This happens because too much time is being spent shutting down, and then restarting all of the children before the child’s pid gets updated in its record. I need to break this up into steps, and update the state each step of the way, triggering the next step with timeout tuples. Steps that are processing many children should update the state for each immediately before triggering the processing of the next child.

There would still be some races at shutdown if a child crashes right before the exit signal for the supervisor is received, but odds are much better if the processing is shorter and the true state is updated sooner.

pguyot · 2025-11-06T20:41:07Z

libs/estdlib/src/supervisor.erl

            {noreply, State}
    end;
+handle_info({try_again_restart, Id}, State) ->
+    Child = lists:keyfind(Id, #child.id, State#state.children),


The child can be gone here.

pguyot · 2025-11-06T20:43:00Z

libs/estdlib/src/supervisor.erl


 loop_terminate([#child{pid = undefined} | Tail], AccRemaining) ->
    loop_terminate(Tail, AccRemaining);
+loop_terminate([#child{pid = {restarting, _}} | Tail], AccRemaining) ->


If we're using restarting in pid, we could also put terminating there as well (?). Also we should type this.

I did not see anywhere that OTP was using {terminating, pid()}, but for our implementation we might want that for internal tracking when children are using timeout shutdown. It wouldn't make sense for brutal-kill, since those should have pid set to undefined (or the child removed entirely) immediately.

Yes, it looks like I forgot to type the pid in the child record, and a few of the added supervisor record fields too.

we might want that for internal tracking when children are using timeout shutdown.

... actually this wouldn't make sense either, we are already tracking this in the #state.restart.

pguyot · 2025-11-06T20:56:16Z

libs/estdlib/src/supervisor.erl

 loop_terminate([#child{pid = Pid} = Child | Tail], AccRemaining) when is_pid(Pid) ->
-    do_terminate(Child),
-    loop_terminate(Tail, [Pid | AccRemaining]);
+    case is_process_alive(Pid) of


I'm not sure about this optimization. The process could be gone between here and the exit message, and it's not worth it unless we need to ensure it's gone.

I'm not sure about this optimization. The process could be gone between here and the exit message, and it's not worth it unless we need to ensure it's gone.

At some point we do need to ensure its gone, or we can get stuck indefinitely in loop_wait_termination/1. I encountered this while testing after adding intensity/period support. It is possible to get stuck in loop_wait_termination, preventing the supervisor from shutting down after the maximum intensity has been reached.

It's a little expensive, but because this is only during shutdown it shouldn't matter, but it would not be a terrible idea to iterate through the pids and be sure they are still alive before each tail call of loop_wait_termination.

What I meant is if the child is gone, we must have received the DOWN message, so we don't need to query the process table to see if it's still there.

UncleGrumpy added 4 commits November 4, 2025 01:10

Add support for supervisor one_for_all strategy

156ba36

Adds support for handling restarts for the `one_for_all` strategy that was documented, but lacked implimentation. Closes atomvm#1855 Signed-off-by: Winford <[email protected]>

UncleGrumpy mentioned this pull request Nov 4, 2025

estdlib supervisor strategy one_for_all not implemented #1855

Open

petermm reviewed Nov 4, 2025

View reviewed changes

tests/libs/estdlib/test_supervisor.erl Outdated Show resolved Hide resolved

UncleGrumpy force-pushed the supervisor-one_for_all-main branch from e0f3afd to 7b2435d Compare November 4, 2025 18:02

pguyot reviewed Nov 6, 2025

View reviewed changes

UncleGrumpy mentioned this pull request Nov 7, 2025

Add gen_server support for OTP-28 timeout tuple return actions #1961

Open

Fix several supervisor issues and bugs #1958

Are you sure you want to change the base?

Fix several supervisor issues and bugs #1958

Uh oh!

Conversation

UncleGrumpy commented Nov 4, 2025

Uh oh!

Uh oh!

pguyot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

UncleGrumpy Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

UncleGrumpy Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UncleGrumpy Nov 8, 2025 •

edited

Loading

UncleGrumpy Nov 7, 2025 •

edited

Loading