Skip to content

Conversation

phargogh
Copy link
Member

@phargogh phargogh commented Jul 8, 2025

This PR monitors changes to the PIDs of workers in the TaskGraph's multiprocessing.Pool, so that if a worker is terminated for any reason, the whole graph is terminated and we log an error message. This avoids a deadlock resulting from events never being triggered that would allow task callables to execute.

phargogh added 9 commits July 8, 2025 14:35
Psutil is so much easier to use than the python stdlib equivalents
relating to getting the PIDs of child processes.

This should not be a problem to install on conda-forge or even from PyPI
these days.

RE:natcap#109
This allows for easier access to the state of the Graph.

RE:natcap#109
It turns out that psutil was not needed for this solution.

RE:natcap#109
It turned out that my modifications to this were not needed.

RE:natcap#109
@phargogh phargogh self-assigned this Jul 8, 2025
@phargogh phargogh requested review from dcdenu4 and richpsharp July 8, 2025 23:19
@phargogh phargogh marked this pull request as ready for review July 8, 2025 23:19
@richpsharp
Copy link
Collaborator

Hey @phargogh, this seems straightforward but I want to just double check why you need to do this to just double check this is right solution? Thinking on a long arc, it is technically possible for process pools to shut down and restart processes if max_tasks_per_child is non-none, that would change the number of processes and their PIDs. I get that shouldn't happen if max_tasks_per_child is None but the point is that it is well defined behavior whereas accessing ._pool wouldn't necessarily be. I see there is also a BrokenProcessPool exception that should be raised if a process dies unexpectedly https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.process.BrokenProcessPool, that should terminate the whole Pool... so just want to make sure this is solving a bug that you actually have. Can you say a little about that?

@phargogh
Copy link
Member Author

phargogh commented Jul 9, 2025

Thanks for taking a look at this, @richpsharp !

So, the main issue here is that I'm currently working around the lack of BrokenProcessPool, as this is a feature of concurrent.futures and not multiprocessing. The current state of taskgraph uses multiprocessing.Pool as its worker backend. Migrating to concurrent.futures is absolutely something I'd like to look into for a future version of the library (#112).

Because I'm unable to handle BrokenProcessPool in this case, the worker here is just trying to work with the known behavior at this time. I realize that accessing _pool or any other private attribute isn't kosher or even guaranteed, but it is also the only way to access the underlying processes at this time in order to detect that there has been a change in the worker processes.

just want to make sure this is solving a bug that you actually have

Yes, the situation where this arises is when I'm operating on a memory-constrained environment like Sherlock, where I have to request a certain amount of memory available for my process. If one of my multiprocessing.Pool tasks exceeds the allowed memory, the kernel swoops in and kills the underlying process. When this happens, multiprocessing.Pool will notice that there's a missing process and it will create a new one. Because I cannot handle the event in any other way without either 1) implementing my own pool that kills the graph instead of creating a new process or 2) refactoring to use concurrent.Futures to take advantage of BrokenProcessPool, the changes I'm proposing here are a simple, straightforward way to work within current behavior of multiprocessing.Pool to shut down the graph when a change in PIDs is detected.

dcdenu4
dcdenu4 previously approved these changes Jul 11, 2025
Copy link
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phargogh. This workaround for what you laid out in response to @richpsharp good questions makes sense to me. I'll hold off on a merge for Rich to take another look.

@phargogh
Copy link
Member Author

I'm currently evaluating whether a move to concurrent.futures would be an acceptable alternative, so moving to draft for now.

@phargogh phargogh marked this pull request as draft July 15, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants