Replies: 17 comments
-
I think it would be relatively straightforward to slap some "malleable" "attribute" on a job specification that a relevant adapter could communicate to a queuing system. And, in terms of a pilot job system, it should not be too difficult to have it respond to changes in job shapes. One thing I'm having difficulty with is the level of support that this will have across LRMs. The other thing I am having difficulty with is the level of interest users will have for this. For AWS, GCE, etc., preemptible instances come with a reduction in price. If there isn't some equivalent incentive here, we can implement it as much as we want, but I suspect users will pick the solution that gives them the best chances of finishing their tasks faster, which I'm guessing is not to select the "malleable" option. To be clear: I like it; I think all LRMs and pilot job systems should support it and I think users should somehow be incentivized to enable it. How far are we from that though aside from this API? |
Beta Was this translation helpful? Give feedback.
-
So @yadudoc points out that a rather useful thing that could be adopted by Parsl immediately would be if the client can mark certain nodes from a job as idle such that the LRM can reclaim those nodes. This can help by reducing the number of jobs that a pilot job system needs to submit to the LRM and improve utilization. The difficulty is that the interface between the LRM and the job API would have to be a bit more complex. For starters, it would require that the LRM allow for some means to retrieve a set of handles to the nodes in a job as well as provide an interface for communicating desired changes in job shape (e.g., remove list_of_nodes from job and possibly add n nodes to job). I think it might be useful to work on figuring out the details of such an interface. |
Beta Was this translation helpful? Give feedback.
-
Slurm can remove nodes from a job using the scontrol command, and this function exists currently. I don't expect Slurm to ever have the ability to increase the size of a running job or merge two jobs into a single larger one. Features in this area are about optimizing the utilization of resources, not about a user computation succeeding or failing. That makes the implementation of executors plausible: "I need 100 nodes for the first half, and then 50 nodes for the second half" can be put into the interface, and the outcome would either be "Yes, this LRM supports it, you successfully use 25% less allocation." or "LRM can't reclaim nodes, your task still completes, but you paid an extra 25% for it." |
Beta Was this translation helpful? Give feedback.
-
PBS also supports the ability for a job to release nodes early (pbs_release_nodes), but that is up to the job. Like Slurm, it does not currently have a way to add nodes to a job. There is also no way for the scheduler and the job to negotiate. Getting everyone to agree on a way to do that would be hard, but useful. As I noted above, I might only need 10 nodes out of a jobs 100 nodes and if it was running a batch of 1 node jobs, it might be able to just release 10 nodes and keep running. As an idea (not suggesting it the way to go), you can send an integer along with SIGUSER1. You could do something like send SIGUSER1 and the number of nodes you need. If they release the nodes you need within some deadline, you got what you need and everyone lives happily ever after, if not, you preempt the entire job. |
Beta Was this translation helpful? Give feedback.
-
@Larofeticus Are you suggesting that the jobs tell the scheduler up front that it intends to release nodes? If so,
I just wasn't clear what you were saying in your 2nd paragraph. |
Beta Was this translation helpful? Give feedback.
-
@hategan The issue of interest from the users is not really at issue here. That is policy and will be decided by the facilities in order to get the behavior they want, not unlike the way a business prices its products. Currently, at the ALCF, we allow you to run in backfill even if your allocation is negative. You only run if you can fit in a backfill window (the nodes would otherwise be idle) so you get extra cycles and we get better utilization. For the preemption we are planning we have not decided what the incentives will be yet, but we recognize that we will need something. Having this functionality would mean we could support on-demand jobs and have less impact on utilization and the preempted jobs. |
Beta Was this translation helpful? Give feedback.
-
@weallcock No, it's not something done in advance. A job is running, a process owned by the user can run an 'scontrol' command to change the nodelist of the job to a subset of itself, then the job goes through a RESIZING state, and the removed nodes are changed to idle to be allocated elsewhere. The second paragraph is about the JPSI interface including features which are not supported by all executors. There is a difference between a missing executor capability making a job impossible to complete, and one which makes it less billing efficient. |
Beta Was this translation helpful? Give feedback.
-
Got it. That makes sense. Thanks for the clarification. |
Beta Was this translation helpful? Give feedback.
-
This adversarial relationship between facilities and their users worries me a bit. |
Beta Was this translation helpful? Give feedback.
-
I support the proposal for malleable allocations: we do see several use cases where this could optimize utilization and reduce cost of experiments. In order to add this capability to the API, we would though need to understand how different LRMs implement that capability (if they do), otherwise it is difficult when (while |
Beta Was this translation helpful? Give feedback.
-
Largely, I think no one has really done this yet, so I think we try something and iterate. So, here is what I have been thinking... A job throws an attribute in its job description that says it is malleable. What that implies is that it supports some mechanism, which we need to define, to interact with the scheduler. One good candidate for this would be workflow jobs that have a bunch of nodes, but are running a bunch of small jobs. I (the scheduler) need 10 nodes, and there are not 10 available. A running workflow job is marked malleable and has 100 nodes that fit my needs. I "tell" them I need 10 nodes. If they do not/ can not release them (pbs_release_nodes or whatever the Slurm equivalent it) in x seconds, I will preempt their entire job. Hopefully, they release just what I need. To answer some of Andre's questions, I dont see a situation where this would apply to QUEUED. Certainly ACTIVE. I am not sure about SUSPENDED, since we never use that. I dont think queues matter here, at least in my example. We are talking to a running job and I (the scheduler) have already decidedI want to use them for this job. I think we have to tell them how many nodes of what type. Since, at least right now, we can't give them back, time doesn't matter. The application giving up the nodes can give up any set they want to as long as the meet the requirements, it does not need to know or care what will be done with them. Thoughts? |
Beta Was this translation helpful? Give feedback.
-
Ah, interesting - I was looking at this from the job (or user) perspective: while running a workflow in a malleable job allocation, I happen to obtain new tasks which require more nodes than the allocation has, so I request more nodes from the scheduler to add to the job -- and inversely when tasks are tapering off (temporarily or during draining), then the job could release nodes back to the LRM.
If the job is, for example, currently running a set of large MPI tasks, it may just not have the ability to release 10 nodes at that point. In that case, |
Beta Was this translation helpful? Give feedback.
-
First, addressing your specific questions above. The releasing nodes as you taper off, or like the example given above, you need to do some initial preparatory computation up front, but then need fewer nodes for the rest of the run is why the pbs_release_nodes functionality was added. You bring up an interesting point about I see the use case for adding nodes and I think we should design that capability into whatever interface we come up with, but it is a much more difficult use case. First, since most clusters run at very high utilization rates, the odds of the scheduler having the nodes available is low and it is essentially overriding the scheduling policy (you are basically "cutting in line"). Also, I know PBS can't do it, and it was mentioned above that Slurm does not support that either. The closest we have come to doing something along those lines was via an HTCondor glide-in. To increase our utilization we experimented with the idea of automatically running HTCondor glide-ins as the last step of the post-job script, but as far as the scheduler was concerned the nodes were idle. When the scheduler started the next job, the first step of the prejob script was a |
Beta Was this translation helpful? Give feedback.
-
Thanks for that scenario, I see your point.
Ack. Still, the main problem with respect to the API defined here remains LRM support: adding a Further, as you stated, one needs to define a mechanism to notify a job of resize requests. A signal would not work I guess, as it can't carry information about the request size, so one would need to augment or replace that with some other mechanism. Even though that mechanism is likely out of scope of J/PSI (at least at this stage), I am curious if you have thoughts or preliminary ideas about that point? |
Beta Was this translation helpful? Give feedback.
-
Actually, a signal can carry an integer along with it, so we could define that to be the number of nodes requested, which would work fine until we want to also give them nodes. That being said, the worflow apps are going to need to be modified to support this, so we could just define an API. They would have to listen on a port, or have a REST interface or something, we would have to figure out security so another user can't mess with them (give me all your nodes), and we would have to come up with a wire protocol. Responding to a signal is pretty standard, but the API would give us more flexibility. I don't recognize the LRM term, but I believe that means things like PBS and Slurm. So, I am just spitballing here, but...
Thoughts? |
Beta Was this translation helpful? Give feedback.
-
The J/PSI team had some scoping conversations, and we decided that this use case (while very compelling and decidedly useful) is outside the scope of our initial efforts. We think some initial proof-of-concept groundwork needs to be done before this use case is ready for standardization/specification. We don't want to lose this discussion though, so we wanted to convert this to a GitHub discussion (as opposed to an issue) to continue the conversation there. Does that work for those involved? If there is no objection by next week, we will do the conversion to a discussion. |
Beta Was this translation helpful? Give feedback.
-
I don't know the difference between an issue and a conversation on GitHub, but it seems fine to me. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Use Case Summary
It would be helpful if there was a mechanism by which a scheduler could negotiate an increase or decrease in the number of nodes available to a job. This is essentially a request for what is sometimes referred to as malleable scheduling. This would require changes to both the scheduler and the workflow system (or any other malleable job) as well as some standard way of negotiating the changes.
Use Case Details
To support "on-demand" computing we intend to designate a subset of our resources to be eligible for preemption to make room for "on-demand" or "deadline driven" jobs. An example would be a light source that wants to take a dataset, send it to our facility for analysis, and wants the results back ASAP, certainly not after it sits in a queue for hours. We have chosen to evaluate the preemption path to try and minimize "wasted" cycles, as the nodes might otherwise sit idle for a significant fraction of the time if we just dedicated nodes to them (how much time is wasted obviously depends on the use case). To minimize the impact of preemption, it would be ideal if we preferentially scheduled small, short jobs on the resources designated for preemption so we could kill the minimum number of jobs to get the resources we need (we could kill ten one node jobs, to get 10 nodes rather than a 100 node job) and lose a minimum amount of computation (if the jobs are only minutes in length we lose at most minutes of computation) . However, in many cases, those small, short jobs are not visible to the scheduler because the workflow system has submitted a job to "provision" a larger set of resources for a longer period of time and then runs many smaller jobs within those resources (i.e. a Condor "glide-in"). This means that today when preempting, the scheduler would be forced to kill the entire job, even if it needed only a small fraction of the nodes it was using.
A use case involving an increase in the number of nodes is also potentially valuable. As a scheduler, if I know that I have a "drain window" (nodes that I expect to be empty for a period of time because I have no job in the queue that can fit until another job ends), I could offer those nodes to the workflow system to use temporarily.
If there were a mechanism that allowed a job (the workflow system) to designate itself as "malleable" to the scheduler and then a mechanism for negotiating altering the number of nodes available to the job we could potentially eliminate a lot of wasted cycles.
References
Beta Was this translation helpful? Give feedback.
All reactions