Interaction between workflow system and scheduler (malleable scheduling) #137

weallcock · 2020-12-04T21:43:36Z

weallcock
Dec 4, 2020

Use Case Summary

It would be helpful if there was a mechanism by which a scheduler could negotiate an increase or decrease in the number of nodes available to a job. This is essentially a request for what is sometimes referred to as malleable scheduling. This would require changes to both the scheduler and the workflow system (or any other malleable job) as well as some standard way of negotiating the changes.

Use Case Details

To support "on-demand" computing we intend to designate a subset of our resources to be eligible for preemption to make room for "on-demand" or "deadline driven" jobs. An example would be a light source that wants to take a dataset, send it to our facility for analysis, and wants the results back ASAP, certainly not after it sits in a queue for hours. We have chosen to evaluate the preemption path to try and minimize "wasted" cycles, as the nodes might otherwise sit idle for a significant fraction of the time if we just dedicated nodes to them (how much time is wasted obviously depends on the use case). To minimize the impact of preemption, it would be ideal if we preferentially scheduled small, short jobs on the resources designated for preemption so we could kill the minimum number of jobs to get the resources we need (we could kill ten one node jobs, to get 10 nodes rather than a 100 node job) and lose a minimum amount of computation (if the jobs are only minutes in length we lose at most minutes of computation) . However, in many cases, those small, short jobs are not visible to the scheduler because the workflow system has submitted a job to "provision" a larger set of resources for a longer period of time and then runs many smaller jobs within those resources (i.e. a Condor "glide-in"). This means that today when preempting, the scheduler would be forced to kill the entire job, even if it needed only a small fraction of the nodes it was using.

A use case involving an increase in the number of nodes is also potentially valuable. As a scheduler, if I know that I have a "drain window" (nodes that I expect to be empty for a period of time because I have no job in the queue that can fit until another job ends), I could offer those nodes to the workflow system to use temporarily.

If there were a mechanism that allowed a job (the workflow system) to designate itself as "malleable" to the scheduler and then a mechanism for negotiating altering the number of nodes available to the job we could potentially eliminate a lot of wasted cycles.

References

hategan · 2020-12-04T22:40:55Z

hategan
Dec 4, 2020
Maintainer

I think it would be relatively straightforward to slap some "malleable" "attribute" on a job specification that a relevant adapter could communicate to a queuing system. And, in terms of a pilot job system, it should not be too difficult to have it respond to changes in job shapes.

One thing I'm having difficulty with is the level of support that this will have across LRMs.

The other thing I am having difficulty with is the level of interest users will have for this. For AWS, GCE, etc., preemptible instances come with a reduction in price. If there isn't some equivalent incentive here, we can implement it as much as we want, but I suspect users will pick the solution that gives them the best chances of finishing their tasks faster, which I'm guessing is not to select the "malleable" option.

To be clear: I like it; I think all LRMs and pilot job systems should support it and I think users should somehow be incentivized to enable it. How far are we from that though aside from this API?

0 replies

hategan · 2020-12-08T19:30:35Z

hategan
Dec 8, 2020
Maintainer

So @yadudoc points out that a rather useful thing that could be adopted by Parsl immediately would be if the client can mark certain nodes from a job as idle such that the LRM can reclaim those nodes.

This can help by reducing the number of jobs that a pilot job system needs to submit to the LRM and improve utilization.

The difficulty is that the interface between the LRM and the job API would have to be a bit more complex. For starters, it would require that the LRM allow for some means to retrieve a set of handles to the nodes in a job as well as provide an interface for communicating desired changes in job shape (e.g., remove list_of_nodes from job and possibly add n nodes to job).

I think it might be useful to work on figuring out the details of such an interface.

0 replies

Larofeticus · 2020-12-08T21:22:20Z

Larofeticus
Dec 8, 2020

Slurm can remove nodes from a job using the scontrol command, and this function exists currently. I don't expect Slurm to ever have the ability to increase the size of a running job or merge two jobs into a single larger one.

Features in this area are about optimizing the utilization of resources, not about a user computation succeeding or failing. That makes the implementation of executors plausible: "I need 100 nodes for the first half, and then 50 nodes for the second half" can be put into the interface, and the outcome would either be "Yes, this LRM supports it, you successfully use 25% less allocation." or "LRM can't reclaim nodes, your task still completes, but you paid an extra 25% for it."

0 replies

weallcock · 2020-12-09T15:28:12Z

weallcock
Dec 9, 2020
Author

PBS also supports the ability for a job to release nodes early (pbs_release_nodes), but that is up to the job. Like Slurm, it does not currently have a way to add nodes to a job. There is also no way for the scheduler and the job to negotiate. Getting everyone to agree on a way to do that would be hard, but useful. As I noted above, I might only need 10 nodes out of a jobs 100 nodes and if it was running a batch of 1 node jobs, it might be able to just release 10 nodes and keep running. As an idea (not suggesting it the way to go), you can send an integer along with SIGUSER1. You could do something like send SIGUSER1 and the number of nodes you need. If they release the nodes you need within some deadline, you got what you need and everyone lives happily ever after, if not, you preempt the entire job.

0 replies

weallcock · 2020-12-09T15:30:52Z

weallcock
Dec 9, 2020
Author

@Larofeticus Are you suggesting that the jobs tell the scheduler up front that it intends to release nodes? If so,

Would you include that in your scheduling decisions?
is the job that releases or does the scheduler take them away?

I just wasn't clear what you were saying in your 2nd paragraph.

0 replies

weallcock · 2020-12-09T15:36:47Z

weallcock
Dec 9, 2020
Author

@hategan The issue of interest from the users is not really at issue here. That is policy and will be decided by the facilities in order to get the behavior they want, not unlike the way a business prices its products. Currently, at the ALCF, we allow you to run in backfill even if your allocation is negative. You only run if you can fit in a backfill window (the nodes would otherwise be idle) so you get extra cycles and we get better utilization. For the preemption we are planning we have not decided what the incentives will be yet, but we recognize that we will need something. Having this functionality would mean we could support on-demand jobs and have less impact on utilization and the preempted jobs.

0 replies

Larofeticus · 2020-12-09T15:53:12Z

Larofeticus
Dec 9, 2020

@weallcock No, it's not something done in advance. A job is running, a process owned by the user can run an 'scontrol' command to change the nodelist of the job to a subset of itself, then the job goes through a RESIZING state, and the removed nodes are changed to idle to be allocated elsewhere.

The second paragraph is about the JPSI interface including features which are not supported by all executors. There is a difference between a missing executor capability making a job impossible to complete, and one which makes it less billing efficient.

0 replies

weallcock · 2020-12-09T15:58:00Z

weallcock
Dec 9, 2020
Author

Got it. That makes sense. Thanks for the clarification.

0 replies

hategan · 2020-12-09T19:42:21Z

hategan
Dec 9, 2020
Maintainer

The issue of interest from the users is not really at issue here.

This adversarial relationship between facilities and their users worries me a bit.

0 replies

andre-merzky · 2020-12-15T18:47:37Z

andre-merzky
Dec 15, 2020
Maintainer

I support the proposal for malleable allocations: we do see several use cases where this could optimize utilization and reduce cost of experiments. In order to add this capability to the API, we would though need to understand how different LRMs implement that capability (if they do), otherwise it is difficult when (while QUEUED, ACTIVE, SUSPENDED?) and how (a command from the user, some action from within the allocation, predefined in the initial job request?) a resize can be triggered, what parameters need to be provided (new layout, time to resize, trigger reschedule, enforce suspend/resume?), and what are the constraints on the resize (same queue? same node type? same executable/batch script?). Is that something we can, at least roughly, collect, or is the resize capability so rare / unavailable that we can make those up according to our set of use cases?

0 replies

weallcock · 2020-12-16T23:06:35Z

weallcock
Dec 16, 2020
Author

Largely, I think no one has really done this yet, so I think we try something and iterate. So, here is what I have been thinking... A job throws an attribute in its job description that says it is malleable. What that implies is that it supports some mechanism, which we need to define, to interact with the scheduler. One good candidate for this would be workflow jobs that have a bunch of nodes, but are running a bunch of small jobs. I (the scheduler) need 10 nodes, and there are not 10 available. A running workflow job is marked malleable and has 100 nodes that fit my needs. I "tell" them I need 10 nodes. If they do not/ can not release them (pbs_release_nodes or whatever the Slurm equivalent it) in x seconds, I will preempt their entire job. Hopefully, they release just what I need. To answer some of Andre's questions, I dont see a situation where this would apply to QUEUED. Certainly ACTIVE. I am not sure about SUSPENDED, since we never use that. I dont think queues matter here, at least in my example. We are talking to a running job and I (the scheduler) have already decidedI want to use them for this job. I think we have to tell them how many nodes of what type. Since, at least right now, we can't give them back, time doesn't matter. The application giving up the nodes can give up any set they want to as long as the meet the requirements, it does not need to know or care what will be done with them. Thoughts?

0 replies

andre-merzky · 2020-12-16T23:38:04Z

andre-merzky
Dec 16, 2020
Maintainer

If they do not/ can not release them (pbs_release_nodes or whatever the Slurm equivalent it) in x seconds, I will preempt their entire job.

Ah, interesting - I was looking at this from the job (or user) perspective: while running a workflow in a malleable job allocation, I happen to obtain new tasks which require more nodes than the allocation has, so I request more nodes from the scheduler to add to the job -- and inversely when tasks are tapering off (temporarily or during draining), then the job could release nodes back to the LRM.

I "tell" them [the job] I need 10 nodes.

If the job is, for example, currently running a set of large MPI tasks, it may just not have the ability to release 10 nodes at that point. In that case, malleable is just the same as preemptable, right?

0 replies

weallcock · 2020-12-17T15:20:28Z

weallcock
Dec 17, 2020
Author

First, addressing your specific questions above. The releasing nodes as you taper off, or like the example given above, you need to do some initial preparatory computation up front, but then need fewer nodes for the rest of the run is why the pbs_release_nodes functionality was added.

You bring up an interesting point about malleable vs preemptable. I was giving a very specific use case that we have in mind and in our case, yes, that is true, but we should not assume that. It is a matter of policy. It could be the case that there is a job marked malleable and I ask them to give up the nodes. If they do, great, but if not, I actually go preempt some other job. I believe malleable and preemptable should be independent concepts and how they interact is a matter of site policy.

I see the use case for adding nodes and I think we should design that capability into whatever interface we come up with, but it is a much more difficult use case. First, since most clusters run at very high utilization rates, the odds of the scheduler having the nodes available is low and it is essentially overriding the scheduling policy (you are basically "cutting in line"). Also, I know PBS can't do it, and it was mentioned above that Slurm does not support that either.

The closest we have come to doing something along those lines was via an HTCondor glide-in. To increase our utilization we experimented with the idea of automatically running HTCondor glide-ins as the last step of the post-job script, but as far as the scheduler was concerned the nodes were idle. When the scheduler started the next job, the first step of the prejob script was a condor down fast or something to that equivalent. The use of HTCondor fell off, so it never really went anywhere. The next step would have been to essentially mimic some of our users. Some science, LQCD being an example, can be very flexible in terms of the size and wall time of their jobs. They have scripts that monitor the drain windows on our machines and then submit jobs specifically tailored to fit those windows. We could do a similar sort of thing with a workflow job if we knew it had queued work. That wouldn't really be using the interface as described here, but that is the only way I have seen trying to take advantage of increasing nodes opportunistically work.

0 replies

andre-merzky · 2020-12-17T18:06:58Z

andre-merzky
Dec 17, 2020
Maintainer

It could be the case that there is a job marked malleable and I ask them to give up the nodes. If they do, great, but if not, I actually go preempt some other job.

Thanks for that scenario, I see your point.

We could do a similar sort of thing with a workflow job if we knew it had queued work. That wouldn't really be using the interface as described here, but that is the only way I have seen trying to take advantage of increasing nodes opportunistically work.

Ack.

Still, the main problem with respect to the API defined here remains LRM support: adding a malleable attribute is no problem really - as of the last iteration the API allows for implementation specific job attributes, and malleable could be immediately supported via that route. However: any J/PSI implementation would depend on the LRM support for the functionality, and as you state, that is rare and not universally available.

Further, as you stated, one needs to define a mechanism to notify a job of resize requests. A signal would not work I guess, as it can't carry information about the request size, so one would need to augment or replace that with some other mechanism. Even though that mechanism is likely out of scope of J/PSI (at least at this stage), I am curious if you have thoughts or preliminary ideas about that point?

0 replies

weallcock · 2020-12-18T00:05:15Z

weallcock
Dec 18, 2020
Author

Actually, a signal can carry an integer along with it, so we could define that to be the number of nodes requested, which would work fine until we want to also give them nodes.

That being said, the worflow apps are going to need to be modified to support this, so we could just define an API. They would have to listen on a port, or have a REST interface or something, we would have to figure out security so another user can't mess with them (give me all your nodes), and we would have to come up with a wire protocol. Responding to a signal is pretty standard, but the API would give us more flexibility.

I don't recognize the LRM term, but I believe that means things like PBS and Slurm. So, I am just spitballing here, but...

Application (parsl, balsam, condor) has to register a signal handler, it gets the signal, checks the integer, sees if it can give up that many nodes, if it can, it does, if not it doesn't. How long do they have? With a signal we can't tell them, nor can they tell us if they are going to or not. I guess they could signal back with SIGUSR2 with the number they released. We are also assuming every node is an acceptable candidate. If they have different classes of resources, we have no way of telling them which ones are acceptable. If we defined an API, we would tell them how many, what type, and how long they have to do it. We do have a race condition. If they sent the ack right at the buzzer, we might start preempting something and then find out they released the nodes. That is suboptimal, but doesn't actually break anything.
LRM (I am thinking PBS here) - We need to play with this and see how easily we could plug this into an event or the scheduling logic. The general outline would be
- job comes in to the on-demand queue
- we look to see if there are available nodes; if yes, we are done, if not
- we look at malleable jobs and see if any of them have the resources we need, if yes, we signal one and ask them to give them up, if they do, we are done, if not,
- depending on the deadline, maybe we iterate though other malleable jobs, or
- We pick a job and preempt
  I know we could write a custom scheduler, but this will be a lot more interesting if we can figure out how to make this plug-ins that any site could load in a standard PBS installation. We will need to poke at that and see what we can do.

Thoughts?

0 replies

SteVwonder · 2021-05-01T00:22:23Z

SteVwonder
May 1, 2021
Maintainer

The J/PSI team had some scoping conversations, and we decided that this use case (while very compelling and decidedly useful) is outside the scope of our initial efforts. We think some initial proof-of-concept groundwork needs to be done before this use case is ready for standardization/specification.

We don't want to lose this discussion though, so we wanted to convert this to a GitHub discussion (as opposed to an issue) to continue the conversation there. Does that work for those involved? If there is no objection by next week, we will do the conversion to a discussion.

0 replies

weallcock · 2021-05-01T16:39:08Z

weallcock
May 1, 2021
Author

I don't know the difference between an issue and a conversation on GitHub, but it seems fine to me.

0 replies

Interaction between workflow system and scheduler (malleable scheduling) #137

Uh oh!

weallcock Dec 4, 2020

Use Case Summary

Use Case Details

References

Replies: 17 comments

Uh oh!

hategan Dec 4, 2020 Maintainer

Uh oh!

hategan Dec 8, 2020 Maintainer

Uh oh!

Larofeticus Dec 8, 2020

Uh oh!

weallcock Dec 9, 2020 Author

Uh oh!

weallcock Dec 9, 2020 Author

Uh oh!

weallcock Dec 9, 2020 Author

Uh oh!

Larofeticus Dec 9, 2020

Uh oh!

weallcock Dec 9, 2020 Author

Uh oh!

hategan Dec 9, 2020 Maintainer

Uh oh!

andre-merzky Dec 15, 2020 Maintainer

Uh oh!

weallcock Dec 16, 2020 Author

Uh oh!

andre-merzky Dec 16, 2020 Maintainer

Uh oh!

weallcock Dec 17, 2020 Author

Uh oh!

andre-merzky Dec 17, 2020 Maintainer

Uh oh!

weallcock Dec 18, 2020 Author

Uh oh!

SteVwonder May 1, 2021 Maintainer

Uh oh!

weallcock May 1, 2021 Author

weallcock
Dec 4, 2020

hategan
Dec 4, 2020
Maintainer

hategan
Dec 8, 2020
Maintainer

Larofeticus
Dec 8, 2020

weallcock
Dec 9, 2020
Author

weallcock
Dec 9, 2020
Author

weallcock
Dec 9, 2020
Author

Larofeticus
Dec 9, 2020

weallcock
Dec 9, 2020
Author

hategan
Dec 9, 2020
Maintainer

andre-merzky
Dec 15, 2020
Maintainer

weallcock
Dec 16, 2020
Author

andre-merzky
Dec 16, 2020
Maintainer

weallcock
Dec 17, 2020
Author

andre-merzky
Dec 17, 2020
Maintainer

weallcock
Dec 18, 2020
Author

SteVwonder
May 1, 2021
Maintainer

weallcock
May 1, 2021
Author