Skip to content

Update to release 24.05.7 #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Mar 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
8b2a2ea
Testsuite - Improve setup fixture to use a fresh StateSaveLocation
agilmor Feb 27, 2025
78c3e60
Testsuite - Minor logging improvement
agilmor Feb 27, 2025
48131ec
Merge branch 'cherrypick-653-24.05' into 'slurm-24.05'
agilmor Feb 27, 2025
bcf07bd
Move code to a function
MarshallGarey Feb 15, 2025
3ea0f25
Add variable rc instead of reusing variable i
MarshallGarey Feb 15, 2025
0f1b914
Prevent crash - Set reservation partition as well as part_ptr
MarshallGarey Feb 15, 2025
f99fc45
Changelog for the previous three commits
MarshallGarey Feb 15, 2025
0c1b81d
Merge branch 'cherrypick-427-24.05' into 'slurm-24.05'
mcmult Mar 3, 2025
8b56775
Testsuite - Fix test_123_1 avoiding false failures
nprisbrey Mar 1, 2025
e3dc1bb
Merge branch 'cherrypick-672-24.05' into 'slurm-24.05'
agilmor Mar 11, 2025
3ebb23a
Fix powered down node set weight
bsngardner Mar 7, 2025
5a59312
Merge branch 'cherrypick-694-24.05' into 'slurm-24.05'
gaijin03 Mar 11, 2025
40605bc
Fix memory leak in _pick_restricted_cores()
Mar 12, 2025
d064ad0
Merge branch 'cherrypick-726-24.05' into 'slurm-24.05'
gaijin03 Mar 13, 2025
f7f5fc7
Fix assoc_mgr_unlock() without a previous lock
MarshallGarey Mar 12, 2025
a70d433
Make _validate_operator_internal()
MarshallGarey Mar 12, 2025
210c6a9
Fix calling assoc_mgr_lock() recursively
MarshallGarey Mar 12, 2025
97ab595
Changelog for the prior three commits
MarshallGarey Mar 12, 2025
b46400f
Merge branch 'cherrypick-731-24.05' into 'slurm-24.05'
gaijin03 Mar 13, 2025
90ea9d7
Testsuite - Improve start_slurm() extracting a backtrace if it fails
agilmor Mar 12, 2025
c656770
Merge branch 'cherrypick-737-24.05' into 'slurm-24.05'
agilmor Mar 13, 2025
234f0ab
Testsuite - Enable slurmrestd log in the python testsuite
agilmor Mar 13, 2025
fab9f31
Testsuite - Improve the auto-generated backtraces
agilmor Mar 13, 2025
6fbe7ce
Fix validate_operator() check
gaijin03 Mar 13, 2025
f2c8468
Merge branch 'cherrypick-743-24.05' into 'slurm-24.05'
MarshallGarey Mar 13, 2025
64168a1
Docs - Update REST API reference
MarshallGarey Mar 13, 2025
fb3209a
Populate NEWS for 24.05.7
mcmult Mar 13, 2025
700aef3
Update META for 24.05.7.
mcmult Mar 13, 2025
aff7d0c
Merge branch 'cherrypick-747-24.05' into 'slurm-24.05'
agilmor Mar 14, 2025
195c68c
Merge branch 'slurm-24.05' into 24.05.7.ug
itkovian Mar 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions META
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
Name: slurm
Major: 24
Minor: 05
Micro: 6
Version: 24.05.6
Micro: 7
Version: 24.05.7
Release: 1

##
Expand Down
11 changes: 11 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

* Changes in Slurm 24.05.7
==========================
-- Fix slurmctld crash when after updating a reservation with an empty
nodelist. The crash could occur after restarting slurmctld, or if
downing/draining a node in the reservation with the REPLACE or REPLACE_DOWN
flag.
-- Fix jobs being scheduled on higher weighted powered down
nodes.
-- Fix memory leak when RestrictedCoresPerGPU is enabled.
-- Prevent slurmctld deadlock in the assoc mgr.

* Changes in Slurm 24.05.6
==========================
-- data_parser/v0.0.40 - Prevent a segfault in the slurmrestd when
Expand Down
2 changes: 1 addition & 1 deletion debian/changelog
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
slurm-smd (24.05.6-1) UNRELEASED; urgency=medium
slurm-smd (24.05.7-1) UNRELEASED; urgency=medium

* Initial release.

Expand Down
2 changes: 1 addition & 1 deletion doc/html/rest_api.shtml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<div class="app-desc">API to access and control Slurm</div>
<div class="app-desc">More information: <a href="https://www.schedmd.com/">https://www.schedmd.com/</a></div>
<div class="app-desc">Contact Info: <a href="[email protected]">[email protected]</a></div>
<div class="app-desc">Version: Slurm-24.05.6&amp;openapi/slurmdbd&amp;openapi/slurmctld</div>
<div class="app-desc">Version: Slurm-24.05.7&amp;openapi/slurmdbd&amp;openapi/slurmctld</div>
<div class="app-desc">BasePath:</div>
<div class="license-info">Apache 2.0</div>
<div class="license-url">https://www.apache.org/licenses/LICENSE-2.0.html</div>
Expand Down
2 changes: 1 addition & 1 deletion slurm.spec
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Name: slurm
Version: 24.05.6
Version: 24.05.7
%define rel 1
Release: %{rel}.%{gittag}%{?dist}%{?gpu}.ug
Summary: Slurm Workload Manager
Expand Down
3 changes: 2 additions & 1 deletion src/common/assoc_mgr.c
Original file line number Diff line number Diff line change
Expand Up @@ -3328,7 +3328,8 @@ extern bool assoc_mgr_is_user_acct_coord(void *db_conn,
if (!is_locked)
assoc_mgr_lock(&locks);
if (!assoc_mgr_coord_list || !list_count(assoc_mgr_coord_list)) {
assoc_mgr_unlock(&locks);
if (!is_locked)
assoc_mgr_unlock(&locks);
return false;
}

Expand Down
6 changes: 6 additions & 0 deletions src/plugins/select/cons_tres/gres_sock_list.c
Original file line number Diff line number Diff line change
Expand Up @@ -677,6 +677,12 @@ static void _pick_restricted_cores(bitstr_t *core_bitmap,
gres_js->res_gpu_cores = xcalloc(gres_js->res_array_size,
sizeof(bitstr_t *));
}
/*
* This function can be called multiple times for the same node_i while
* a job is pending. Free any existing gres_js->res_gpu_cores[node_i]
* first.
*/
FREE_NULL_BITMAP(gres_js->res_gpu_cores[node_i]);
gres_js->res_gpu_cores[node_i] = bit_alloc(bit_size(core_bitmap));

for (int i = 0; i < gres_ns->topo_cnt; i++) {
Expand Down
2 changes: 1 addition & 1 deletion src/slurmctld/job_mgr.c
Original file line number Diff line number Diff line change
Expand Up @@ -4742,7 +4742,7 @@ static void _apply_signal_jobs_filter(job_record_t *job_ptr,

/* Verify that the user can kill the requested job */
if ((job_ptr->user_id != auth_uid) &&
!validate_operator(auth_uid) &&
!validate_operator_locked(auth_uid) &&
!assoc_mgr_is_user_acct_coord(acct_db_conn, auth_uid,
job_ptr->account, true)) {
slurm_selected_step_t *use_id;
Expand Down
2 changes: 2 additions & 0 deletions src/slurmctld/node_scheduler.c
Original file line number Diff line number Diff line change
Expand Up @@ -4147,6 +4147,8 @@ static int _build_node_list(job_record_t *job_ptr,
node_set_ptr[node_set_inx].node_cnt = power_cnt;
node_set_ptr[i].node_cnt -= power_cnt;
node_set_ptr[node_set_inx].flags = NODE_SET_POWER_DN;
node_set_ptr[node_set_inx].node_weight =
node_set_ptr[i].node_weight;
node_set_ptr[node_set_inx].features =
xstrdup(node_set_ptr[i].features);
node_set_ptr[node_set_inx].feature_bits =
Expand Down
29 changes: 24 additions & 5 deletions src/slurmctld/proc_req.c
Original file line number Diff line number Diff line change
Expand Up @@ -575,18 +575,37 @@ extern bool validate_super_user(uid_t uid)
* IN uid - user to validate
* RET true if permitted to run, false otherwise
*/
extern bool validate_operator(uid_t uid)
static bool _validate_operator_internal(uid_t uid, bool locked)
{
slurmdb_admin_level_t level;

#ifndef NDEBUG
if (drop_priv)
return false;
#endif
if ((uid == 0) || (uid == slurm_conf.slurm_user_id) ||
assoc_mgr_get_admin_level(acct_db_conn, uid) >=
SLURMDB_ADMIN_OPERATOR)

if ((uid == 0) || (uid == slurm_conf.slurm_user_id))
return true;

if (locked)
level = assoc_mgr_get_admin_level_locked(acct_db_conn, uid);
else
return false;
level = assoc_mgr_get_admin_level(acct_db_conn, uid);

if (level >= SLURMDB_ADMIN_OPERATOR)
return true;

return false;
}

extern bool validate_operator(uid_t uid)
{
return _validate_operator_internal(uid, false);
}

extern bool validate_operator_locked(uid_t uid)
{
return _validate_operator_internal(uid, true);
}

extern bool validate_operator_user_rec(slurmdb_user_rec_t *user)
Expand Down
37 changes: 23 additions & 14 deletions src/slurmctld/reservation.c
Original file line number Diff line number Diff line change
Expand Up @@ -3343,6 +3343,19 @@ static int _validate_reservation_access_update(void *x, void *y)
return 0;
}

static int _validate_and_set_partition(part_record_t **part_ptr,
char **partition)
{
if (*part_ptr == NULL) {
*part_ptr = default_part_loc;
if (*part_ptr == NULL)
return ESLURM_DEFAULT_PARTITION_NOT_SET;
}
xfree(*partition);
*partition = xstrdup((*part_ptr)->name);
return SLURM_SUCCESS;
}

/* Update an exiting resource reservation */
extern int update_resv(resv_desc_msg_t *resv_desc_ptr, char **err_msg)
{
Expand Down Expand Up @@ -5016,7 +5029,7 @@ extern int validate_job_resv(job_record_t *job_ptr)
static int _resize_resv(slurmctld_resv_t *resv_ptr, uint32_t node_cnt)
{
bitstr_t *tmp2_bitmap = NULL;
int delta_node_cnt, i;
int delta_node_cnt, i, rc;
resv_desc_msg_t resv_desc;
resv_select_t resv_select = { 0 };

Expand Down Expand Up @@ -5061,10 +5074,9 @@ static int _resize_resv(slurmctld_resv_t *resv_ptr, uint32_t node_cnt)
}

/* Ensure if partition exists in reservation otherwise use default */
if (!resv_ptr->part_ptr) {
resv_ptr->part_ptr = default_part_loc;
if (!resv_ptr->part_ptr)
return ESLURM_DEFAULT_PARTITION_NOT_SET;
if ((rc = _validate_and_set_partition(&resv_ptr->part_ptr,
&resv_ptr->partition))) {
return rc;
}

/* Must increase node count. Make this look like new request so
Expand All @@ -5086,10 +5098,10 @@ static int _resize_resv(slurmctld_resv_t *resv_ptr, uint32_t node_cnt)
bit_and_not(resv_select.node_bitmap, resv_ptr->node_bitmap);
}

i = _select_nodes(&resv_desc, &resv_ptr->part_ptr, &resv_select);
rc = _select_nodes(&resv_desc, &resv_ptr->part_ptr, &resv_select);
xfree(resv_desc.node_list);
xfree(resv_desc.partition);
if (i == SLURM_SUCCESS) {
if (rc == SLURM_SUCCESS) {
job_record_t *job_ptr = resv_desc.job_ptr;
/*
* If the reservation was 0 node count before (ANY_NODES) this
Expand Down Expand Up @@ -5121,7 +5133,7 @@ static int _resize_resv(slurmctld_resv_t *resv_ptr, uint32_t node_cnt)
}
job_record_delete(resv_desc.job_ptr);

return i;
return rc;
}

static int _feature_has_node_cnt(void *x, void *key)
Expand Down Expand Up @@ -5317,12 +5329,9 @@ static int _select_nodes(resv_desc_msg_t *resv_desc_ptr,
list_itr_t *itr;
job_record_t *job_ptr;

if (*part_ptr == NULL) {
*part_ptr = default_part_loc;
if (*part_ptr == NULL)
return ESLURM_DEFAULT_PARTITION_NOT_SET;
xfree(resv_desc_ptr->partition); /* should be no-op */
resv_desc_ptr->partition = xstrdup((*part_ptr)->name);
if ((rc = _validate_and_set_partition(part_ptr,
&resv_desc_ptr->partition))) {
return rc;
}

xassert(resv_desc_ptr->job_ptr);
Expand Down
1 change: 1 addition & 0 deletions src/slurmctld/slurmctld.h
Original file line number Diff line number Diff line change
Expand Up @@ -2052,6 +2052,7 @@ extern bool validate_super_user(uid_t uid);
* RET true if permitted to run, false otherwise
*/
extern bool validate_operator(uid_t uid);
extern bool validate_operator_locked(uid_t uid);

/*
* validate_operator_user_rec - validate that the user is authorized at the
Expand Down
23 changes: 23 additions & 0 deletions testsuite/python/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,21 @@ def module_setup(request, tmp_path_factory):
)
atf.stop_slurm(quiet=True)

# Cleanup StateSaveLocation for auto-config
if atf.properties["auto-config"]:
statesaveloc = atf.get_config_parameter(
"StateSaveLocation", live=False, quiet=True
)
if os.path.exists(statesaveloc):
if os.path.exists(statesaveloc + name):
logging.warning(
f"Backup for StateSaveLocation already exists ({statesaveloc+name}). Removing it."
)
atf.run_command(f"rm -rf {statesaveloc+name}", user="root", quiet=True)
atf.run_command(
f"mv {statesaveloc} {statesaveloc+name}", user="root", quiet=True
)

yield

# Return to the folder from which pytest was executed
Expand All @@ -184,6 +199,14 @@ def module_setup(request, tmp_path_factory):
# Teardown
module_teardown()

# Restore StateSaveLocation for auto-config
if atf.properties["auto-config"]:
atf.run_command(f"rm -rf {statesaveloc}", user="root", quiet=True)
if os.path.exists(statesaveloc + name):
atf.run_command(
f"mv {statesaveloc+name} {statesaveloc}", user="root", quiet=True
)


def module_teardown():
failures = []
Expand Down
44 changes: 38 additions & 6 deletions testsuite/python/lib/atf.py
Original file line number Diff line number Diff line change
Expand Up @@ -535,6 +535,8 @@ def start_slurmctld(clean=False, quiet=False):
if not properties["auto-config"]:
require_auto_config("wants to start slurmctld")

logging.debug("Starting slurmctld...")

if not is_slurmctld_running(quiet=quiet):
# Start slurmctld
command = f"{properties['slurm-sbin-dir']}/slurmctld"
Expand All @@ -550,7 +552,19 @@ def start_slurmctld(clean=False, quiet=False):
if not repeat_command_until(
"scontrol ping", lambda results: re.search(r"is UP", results["stdout"])
):
pytest.fail(f"Slurmctld is not running")
logging.warning(
"scontrol ping is not responding, trying to get slurmctld backtrace..."
)
pids = pids_from_exe(f"{properties['slurm-sbin-dir']}/slurmctld")
if not pids:
logging.warning("process slurmctld not found")
for pid in pids:
run_command(
f'sudo gdb -p {pid} -ex "set debuginfod enabled on" -ex "set pagination off" -ex "set confirm off" -ex "set print pretty on" -ex "set max-value-size unlimited" -ex "set print array-indexes on" -ex "set print array off" -ex "thread apply all bt full" -ex "quit"'
)
pytest.fail("Slurmctld is not running")
else:
logging.debug("Slurmctld started successfully")


def start_slurmdbd(clean=False, quiet=False):
Expand Down Expand Up @@ -591,7 +605,17 @@ def start_slurmdbd(clean=False, quiet=False):
if not repeat_command_until(
"sacctmgr show cluster", lambda results: results["exit_code"] == 0
):
pytest.fail(f"Slurmdbd is not running")
logging.warning(
"sacctmgr show cluster is not responding, trying to get slurmdbd backtrace..."
)
pids = pids_from_exe(f"{properties['slurm-sbin-dir']}/slurmdbd")
if not pids:
logging.warning("process slurmdbd not found")
for pid in pids:
run_command(
f'sudo gdb -p {pid} -ex "set debuginfod enabled on" -ex "set pagination off" -ex "set confirm off" -ex "set print pretty on" -ex "set max-value-size unlimited" -ex "set print array-indexes on" -ex "set print array off" -ex "thread apply all bt full" -ex "quit"'
)
pytest.fail("Slurmdbd is not running")
else:
logging.debug("Slurmdbd started successfully")

Expand Down Expand Up @@ -813,7 +837,7 @@ def stop_slurm(fatal=True, quiet=False):
logging.warning("Getting the bt of the still running slurmctld")
for pid in pids:
run_command(
f'sudo gdb -p {pid} -ex "set debuginfod enabled on" -ex "set pagination off" -ex "set confirm off" -ex "thread apply all bt" -ex "quit"'
f'sudo gdb -p {pid} -ex "set debuginfod enabled on" -ex "set pagination off" -ex "set confirm off" -ex "set print pretty on" -ex "set max-value-size unlimited" -ex "set print array-indexes on" -ex "set print array off" -ex "thread apply all bt full" -ex "quit"'
)

# Build list of slurmds
Expand Down Expand Up @@ -844,7 +868,7 @@ def stop_slurm(fatal=True, quiet=False):
failures.append(f"Some slurmds are still running ({pids})")
for pid in pids:
run_command(
f'sudo gdb -p {pid} -ex "set debuginfod enabled on" -ex "set pagination off" -ex "set confirm off" -ex "thread apply all bt" -ex "quit"'
f'sudo gdb -p {pid} -ex "set debuginfod enabled on" -ex "set pagination off" -ex "set confirm off" -ex "set print pretty on" -ex "set max-value-size unlimited" -ex "set print array-indexes on" -ex "set print array off" -ex "thread apply all bt full" -ex "quit"'
)
run_command(f"pgrep -f {properties['slurm-sbin-dir']}/slurmd -a", quiet=quiet)

Expand All @@ -855,6 +879,7 @@ def stop_slurm(fatal=True, quiet=False):
properties["slurmrestd"].wait(timeout=60)
except:
properties["slurmrestd"].kill()
properties["slurmrestd_log"].close()

if failures:
if fatal:
Expand Down Expand Up @@ -1869,6 +1894,13 @@ def start_slurmrestd():
port = None
attempts = 0

log_dir = os.path.dirname(
get_config_parameter("SlurmctldLogFile", live=False, quiet=True)
)
properties["slurmrestd_log"] = open(f"{log_dir}/slurmrestd.log", "w")
if not properties["slurmrestd_log"]:
pytest.fail(f"Unable to open slurmrestd log: {log_dir}/slurmrestd.log")

while not port and attempts < 15:
port = get_open_port()
attempts += 1
Expand All @@ -1888,8 +1920,8 @@ def start_slurmrestd():
properties["slurmrestd"] = subprocess.Popen(
args,
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
stdout=properties["slurmrestd_log"],
stderr=properties["slurmrestd_log"],
)
s = None

Expand Down
3 changes: 2 additions & 1 deletion testsuite/python/tests/test_123_1.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ def create_resv(request, node_list):
if re.search(
rf"(?:Nodes=)({node2})", atf.run_command_output("scontrol show res resv1")
):
node1, node2 = node_list.reverse()
node_list.reverse()
node1, node2 = node_list

return [node1, node2, flag]

Expand Down