Monitor processes in the Pool for process changes #110

phargogh · 2025-07-08T23:12:08Z

This PR monitors changes to the PIDs of workers in the TaskGraph's multiprocessing.Pool, so that if a worker is terminated for any reason, the whole graph is terminated and we log an error message. This avoids a deadlock resulting from events never being triggered that would allow task callables to execute.

RE:natcap#109

Psutil is so much easier to use than the python stdlib equivalents relating to getting the PIDs of child processes. This should not be a problem to install on conda-forge or even from PyPI these days. RE:natcap#109

This allows for easier access to the state of the Graph. RE:natcap#109

RE:natcap#109

It turns out that psutil was not needed for this solution. RE:natcap#109

It turned out that my modifications to this were not needed. RE:natcap#109

richpsharp · 2025-07-09T18:35:14Z

Hey @phargogh, this seems straightforward but I want to just double check why you need to do this to just double check this is right solution? Thinking on a long arc, it is technically possible for process pools to shut down and restart processes if max_tasks_per_child is non-none, that would change the number of processes and their PIDs. I get that shouldn't happen if max_tasks_per_child is None but the point is that it is well defined behavior whereas accessing ._pool wouldn't necessarily be. I see there is also a BrokenProcessPool exception that should be raised if a process dies unexpectedly https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.process.BrokenProcessPool, that should terminate the whole Pool... so just want to make sure this is solving a bug that you actually have. Can you say a little about that?

phargogh · 2025-07-09T21:37:19Z

Thanks for taking a look at this, @richpsharp !

So, the main issue here is that I'm currently working around the lack of BrokenProcessPool, as this is a feature of concurrent.futures and not multiprocessing. The current state of taskgraph uses multiprocessing.Pool as its worker backend. Migrating to concurrent.futures is absolutely something I'd like to look into for a future version of the library (#112).

Because I'm unable to handle BrokenProcessPool in this case, the worker here is just trying to work with the known behavior at this time. I realize that accessing _pool or any other private attribute isn't kosher or even guaranteed, but it is also the only way to access the underlying processes at this time in order to detect that there has been a change in the worker processes.

just want to make sure this is solving a bug that you actually have

Yes, the situation where this arises is when I'm operating on a memory-constrained environment like Sherlock, where I have to request a certain amount of memory available for my process. If one of my multiprocessing.Pool tasks exceeds the allowed memory, the kernel swoops in and kills the underlying process. When this happens, multiprocessing.Pool will notice that there's a missing process and it will create a new one. Because I cannot handle the event in any other way without either 1) implementing my own pool that kills the graph instead of creating a new process or 2) refactoring to use concurrent.Futures to take advantage of BrokenProcessPool, the changes I'm proposing here are a simple, straightforward way to work within current behavior of multiprocessing.Pool to shut down the graph when a change in PIDs is detected.

dcdenu4

Thanks @phargogh. This workaround for what you laid out in response to @richpsharp good questions makes sense to me. I'll hold off on a merge for Rich to take another look.

RE:natcap#109

phargogh · 2025-07-15T23:45:28Z

I'm currently evaluating whether a move to concurrent.futures would be an acceptable alternative, so moving to draft for now.

RE:natcap#109

phargogh added 9 commits July 8, 2025 14:35

Adding a test to reproduce the issue.

2c012d3

RE:natcap#109

Working on initializing a monitor thread.

8bd2ed2

RE:natcap#109

psutil is now required.

68bea71

Psutil is so much easier to use than the python stdlib equivalents relating to getting the PIDs of child processes. This should not be a problem to install on conda-forge or even from PyPI these days. RE:natcap#109

Moving monitor function to within the Graph object.

257277b

This allows for easier access to the state of the Graph. RE:natcap#109

Documentation. RE:natcap#109

cfe554d

Testing that we log the graph's termination.

0162bdc

RE:natcap#109

Restoring psutil.

f06a187

It turns out that psutil was not needed for this solution. RE:natcap#109

Restoring prior state of process initialization.

e7dfa3c

It turned out that my modifications to this were not needed. RE:natcap#109

Killing with SIGTERM. RE:natcap#109

c6edd06

phargogh self-assigned this Jul 8, 2025

Noting change in history. RE:natcap#109

e9ec3aa

phargogh requested review from dcdenu4 and richpsharp July 8, 2025 23:19

phargogh marked this pull request as ready for review July 8, 2025 23:19

dcdenu4 previously approved these changes Jul 11, 2025

View reviewed changes

Reimplementing the Pool with concurrent.futures.

18321d1

RE:natcap#109

phargogh dismissed dcdenu4’s stale review via 18321d1 July 15, 2025 23:37

phargogh marked this pull request as draft July 15, 2025 23:45

phargogh added 3 commits July 17, 2025 14:23

Adding an event to communicate the executor broke.

128491f

RE:natcap#109

Adding a debug statement. RE:natcap#109

c821e72

Raising the BrokenProcessError after logging. RE:natcap#109

1b40259

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Monitor processes in the Pool for process changes #110

Monitor processes in the Pool for process changes #110

Uh oh!

phargogh commented Jul 8, 2025

Uh oh!

richpsharp commented Jul 9, 2025

Uh oh!

phargogh commented Jul 9, 2025

Uh oh!

dcdenu4 left a comment

Uh oh!

phargogh commented Jul 15, 2025

Uh oh!

Uh oh!

Monitor processes in the Pool for process changes #110

Are you sure you want to change the base?

Monitor processes in the Pool for process changes #110

Uh oh!

Conversation

phargogh commented Jul 8, 2025

Uh oh!

richpsharp commented Jul 9, 2025

Uh oh!

phargogh commented Jul 9, 2025

Uh oh!

dcdenu4 left a comment

Choose a reason for hiding this comment

Uh oh!

phargogh commented Jul 15, 2025

Uh oh!

Uh oh!