Skip to content

Parallelization #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zacps opened this issue Mar 11, 2021 · 16 comments
Closed

Parallelization #5

zacps opened this issue Mar 11, 2021 · 16 comments
Assignees
Labels
enhancement New feature or request p-medium s-2 Approximately 20 hours of work

Comments

@zacps
Copy link
Owner

zacps commented Mar 11, 2021

Each run should be run in parallel if possible.

The library I was using to do this is joblib with the loky backend.

Initially let's only care about a single machine running a single threaded task on multiple cores. This is simple, but a good place to start.

The main complication here is that each task has to run in a separate process, otherwise we'll run afoul of contention due to the GIL.

Joblib's loky backend requires all i/o is serialized to disk and then deserialized to pass it between processes. From memory I had to do some funky things to serialize functools.partial. We should try and hide this from the user as much as possible.

Additionally it doesn't forward stdout/stderr to the parent process so all print/println/... calls get voided. Fixing this would be great, patching joblib/loky might be the best solution but I haven't investigated in detail.

@zacps zacps added the enhancement New feature or request label Mar 11, 2021
@ghost
Copy link

ghost commented Mar 11, 2021

Are we intending to redirect stdout/error streams to be viewable in real time via main interface or are we more interested just logging the output of said streams and displaying after the task is completed?

Also something to note here, if we intend to run each individual task in its own process, a 'process pool' of some sort should be considered so we don't have to constantly start and tear down process objects (which is quite expensive.)

@zacps
Copy link
Owner Author

zacps commented Mar 11, 2021

Are we intending to redirect stdout/error streams to be viewable in real time via main interface or are we more interested just logging the output of said streams and displaying after the task is completed?

I'd consider a live view (probably via a callback) into metrics/logs/etc to be an addon feature.

Redirecting stdout/stderr to the main process has benefits without it however. If you're running python from a terminal then you won't get any output from child processes (in my experience). If you're running it under a job supervisor (systemd, supervisorctl, etc) then it likewise won't appear in their logs.

Also something to note here, if we intend to run each individual task in its own process, a 'process pool' of some sort should be considered so we don't have to constantly start and tear down process objects (which is quite expensive.)

From memory process startup on linux takes ~100 microseconds (can't find a source rn) so it's not a big cost compared to experiments themselves (which could easily take multiple hours).

@zacps zacps added the p-medium label Mar 12, 2021
@zacps zacps assigned ghost Mar 12, 2021
@zacps zacps added the s-2 Approximately 20 hours of work label Mar 12, 2021
@zacps
Copy link
Owner Author

zacps commented Mar 12, 2021

Also assigned: @NeedsSoySauce

@ghost ghost assigned NeedsSoySauce Mar 12, 2021
@ghost
Copy link

ghost commented Mar 15, 2021

@NeedsSoySauce So I did a bit of thinking about this and I think the following is a good set of goals for the 2 weeks

  • Plan/architect an internal API that the others can consume to run tasks in parallel (in essence, the method signature)
  • Create a method to take python code/task and add it to a queue (potentially with a priority)
  • Create a method to spawn an external process and run passed in python code
  • Create a method to take code/tasks from the queue and pass to the method that spawns a process and runs the code
  • Configure the above method to take advantage of parallelism available on system
  • Configure the above method to take options like maximum parallel tasks, maximum run time etc
  • Redirect standard output/error streams to host process (perhaps with an identifier so logging doesn't get mixed up)

Let me know if you think of anything that is missing / anything else we should aim to do - Also if you want to work a specific task let me know so we don't both end up working on the same ones.

@NeedsSoySauce
Copy link
Collaborator

@Dewera Looks good. For a MVP I think this is a good set of goals. The priority queue part can probably be done later (I don't think it's needed for a MVP). Regarding "Configure the above method to take advantage of parallelism available on system", what do you mean by this? Are you referring to something like detecting what kind of hardware/environment the user is running things and adjusting our 'parallelization strategy' based on that?

As for what tasks I'd be keen on - any of them to be honest. The redirecting io part and actually spawning processes parts sound cool to me, but I have little experience with either. Do we plan to use something like joblib to take care of some of these tasks for us as well?

@zacps
Copy link
Owner Author

zacps commented Mar 15, 2021

Output is apparently an ipython specific issue. Up to you if you want to tackle it.

@ghost
Copy link

ghost commented Mar 15, 2021

@NeedsSoySauce I'm not familiar with how parallel code works on python and if it sets it up automatically for you so maybe that task is irrelevant. If possible, it would make sense to add this into the configuring options though i.e. don't use more than x number of threads.

Agree with the priority queue, we can just use a standard queue for now.

As for your query regarding spawning processes, again my knowledge of the python ecosystem is limited, but I would say we use subprocess to spawn external processes and either pass python code directly or wrap in a lambda and pass that around. Really up to whatever is possible/works best,

@ghost
Copy link

ghost commented Mar 15, 2021

@zacps Probably good to get your input here - Are we intending to be working with lambdas or raw python code when passing user intent around the different methods? From my understanding it's easy to serialise and deserialise between the 2, however, just so I can keep it in mind when developing.

@zacps
Copy link
Owner Author

zacps commented Mar 15, 2021

Are we intending to be working with lambdas or raw python code

Not sure what you mean by this? Lambdas should be supported but they may be significantly harder than plain functions depending on your implementation. I think cloudpickle handles this for joblib, dill might also be capable of it.

Options:

Non-options:

@ghost
Copy link

ghost commented Mar 15, 2021

Are we intending to be working with lambdas or raw python code

Not sure what you mean by this? Lambdas should be supported but they may be significantly harder than plain functions depending on your implementation. I think cloudpickle handles this for joblib, dill might also be capable of it.

Options:

Non-options:

I mean in terms of running the actual tasks, what is being given to us? Are we receiving a shell script, python lambda function, python code etc that we need to forward to the processes we are spawning? I don't currently know what we are working with on our end.

@zacps
Copy link
Owner Author

zacps commented Mar 16, 2021

Assume an arbitrary Callable

@NeedsSoySauce
Copy link
Collaborator

NeedsSoySauce commented Mar 19, 2021

@zacps @Dewera Do we want this to be synchronous? e.g. a user submits the tasks they want to run and we only return after everything has been completed, or do we want this to be asynchronous? I took a look at joblib and it seems that it's purely synchronous, so is that all we want? Just wondering if this could get in the way of us e.g. adding progress reporting/notifications.

@zacps
Copy link
Owner Author

zacps commented Mar 19, 2021

Do we want this to be synchronous?

Sync for now at least.

@zacps
Copy link
Owner Author

zacps commented Mar 21, 2021

Extra thing to consider: What happens when an exception is raised in a task?

Options:

  • Abort tasks and re-raise
  • Cancel tasks and re-raise
  • Let other tasks exit normally then return or re-raise

For now we probably want the first behavior but at some point we'll probably also want the third.

@ghost
Copy link

ghost commented Mar 21, 2021

Me and @NeedsSoySauce briefly discussed this last Thursday. The thinking was to start off with the first option and then eventually go on to let the user choose how they want it to play out (essentially an option in the configuration.)

@zacps
Copy link
Owner Author

zacps commented May 6, 2021

I'm going to close this, followups can be their own issues.

@zacps zacps closed this as completed May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request p-medium s-2 Approximately 20 hours of work
Projects
None yet
Development

No branches or pull requests

2 participants