optuna + submitit hp sweeps error stack trace #2774

Zhylkaaa · 2023-10-26T04:06:35Z

Zhylkaaa
Oct 26, 2023

Hi, I have some experience with using slurm and using optuna + joblib launchers for HP sweeps. However in my current setup I would like to start main sweeping process on login machine and make it submit jobs to cluster in batches.
I have a training script that works fine if i just sbatch it to the cluster, but when I attempt to do something like this:

#!/bin/bash

eval "$(conda shell.bash hook)"
conda activate training_env

HYDRA_FULL_ERROR=1 PYTHONPATH=.:$PYTHONPATH python launch_training.py -m \
'model.config.model_name=choice(resnet18, resnet50)' \
'trainer.optimizer.lr=interval(1e-6, 1e-2)' \
trainer.config.max_epochs=10 \
hydra/sweeper=optuna \
hydra/sweeper/sampler=tpe \
hydra.sweeper.n_trials=8 \
hydra.sweeper.n_jobs=8 \
hydra/launcher=submitit_slurm \
hydra.launcher.partition=gpu_partition \
hydra.launcher.cpus_per_task=10 \
hydra.launcher.mem_gb=100 \
hydra.launcher.gres='gpu:a100:1' \
hydra.launcher.timeout_min=$((96 * 60)) \
"hydra.launcher.srun_args=[ '--export=ALL,HYDRA_FULL_ERROR=1' ]" \
+hydra.job.env_set.HYDRA_FULL_ERROR=1

(I pruned a bunch of parameters)

and then:
bash hp_sweep.sh
script fails after a few seconds and there is no way to even access the stack trace because HYDRA_FULL_ERROR doesn't get passed (as you can see I tried every possible way I know)

I was installing submitit launcher with python -m pip install 'git+https://github.com/facebookresearch/hydra.git#egg=hydra-submitit-launcher&subdirectory=plugins/hydra_submitit_launcher' since latest version is not released in PyPI

Is there any way to fix the stack trace issue?

EDIT: I got to the core of a problem, it wasn't actually issue in my code, but submitit for some reason can't decide which env to use:
RuntimeError: Could not figure out which environment the job is runnning in. Known environments: slurm, local, debug.
I tried to hack around this by specifying _TEST_CLUSTER_ is both setup and srun_args but this boils down to the same problem with passing HYDRA_FULL_ERROR because those values got ignored
There are multiple issues regarding this specific problem, but I don't see any clear resolution to this issue. @omry

kap-devkota · 2024-09-26T00:47:16Z

kap-devkota
Sep 26, 2024

Been experiencing the same issue.

1 reply

kap-devkota Nov 26, 2024

Hacked through this error by adding

os.environ["SUBMITIT_EXECUTOR"] = "slurm"

at the start of the submitit.core.plugins._get_plugins function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

optuna + submitit hp sweeps error stack trace #2774

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

optuna + submitit hp sweeps error stack trace #2774

Uh oh!

Uh oh!

Zhylkaaa Oct 26, 2023

Replies: 1 comment · 1 reply

Uh oh!

kap-devkota Sep 26, 2024

Uh oh!

kap-devkota Nov 26, 2024

Zhylkaaa
Oct 26, 2023

Replies: 1 comment 1 reply

kap-devkota
Sep 26, 2024