This code accompanies the paper Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration. For a brief summary of the paper, see the paper's website.
The code is built off of ExPLORe. Our diffusion policy code is adapted from IDQL. Our VAE pre-training code is adapted from Seohong's implementation of OPAL used in HILP.
Before setting up the environment, make sure that MuJoCo and the dependencies for mujoco-py are installed (https://github.com/openai/mujoco-py). Then, run the create_env.sh
script, which will create the conda environment, clone necessary code for running the HILP baseline, and download the pretrained checkpoints.
Pretrained checkpoints for all environments are downloaded in create_env.sh
. Below are the commands used to generate the checkpoints.
python run_opal.py --env_name=antmaze-large-diverse-v2 --seed=1 --vision=False
Replace the env_name with antmaze-large-diverse-v2-2
, antmaze-large-diverse-v2-3
, antmaze-large-diverse-v2-4
to test different goals on AntMaze Large. For AntMaze Medium, use antmaze-medium-diverse-v2(-#)
, and for Ultra use antmaze-ultra-diverse-v0(-#)
.
python run_opal.py --env_name=kitchen-mixed-v0 --seed=1
Replace the env_name with kitchen-partial-v0
and kitchen-complete-v0
to test the other tasks.
python run_opal.py --env_name=antmaze-large-diverse-v2 --seed=1 --vision=True
Replace env_name with antmaze-large-diverse-v2-2
, antmaze-large-diverse-v2-3
, antmaze-large-diverse-v2-4
to test other goals.
python run_opal.py --env_name=cube-single-play-singletask-v0 --seed=1 --config.kl_coef=0.2
python run_opal.py --env_name=cube-double-play-singletask-v0 --seed=1 --config.kl_coef=0.2
python run_opal.py --env_name=scene-play-singletask-v0 --seed=1 --config.kl_coef=0.2
python run_opal.py --env_name=antsoccer-arena-navigate-singletask-task1-v0 --seed=1 --config.kl_coef=0.1
python run_opal.py --env_name=humanoidmaze-medium-navigate-singletask-v0 --seed=1 --config.kl_coef=0.1
python train_finetuning_supe.py --config.backup_entropy=False --env_name=antmaze-large-diverse-v2 --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1
python train_finetuning_supe.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=pred --use_rnd_offline=True --use_rnd_online=True --env_name=kitchen-mixed-v0 --seed=1 --config.init_temperature=1.0
python3 train_finetuning_supe_pixels.py --config.backup_entropy=False --config.num_min_qs=2 --config.num_qs=10 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1 --env_name=antmaze-large-diverse-v2 --use_icvf=True
python3 train_finetuning_supe.py --config.backup_entropy=False --env_name=cube-single-play-singletask-task1-v0 --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1 --opal_config.kl_coef=0.2 --config.discount=0.995
python3 train_finetuning_supe.py --config.backup_entropy=False --env_name=cube-double-play-singletask-task1-v0 --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1 --opal_config.kl_coef=0.2 --config.discount=0.995
python3 train_finetuning_supe.py --config.backup_entropy=False --env_name=scene-play-singletask-task1-v0 --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1 --opal_config.kl_coef=0.2 --config.discount=0.995
python3 train_finetuning_supe.py --config.backup_entropy=False --env_name=antsoccer-arena-navigate-singletask-task1-v0 --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=9 --config.discount=0.995
python3 train_finetuning_supe.py --config.backup_entropy=False --env_name=humanoidmaze-medium-navigate-singletask-task1-v0 --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1 --config.discount=0.995
To run the baseline Online w/ Trajectory Skills, use the same commands as above but add offline_ratio=0
and set use_rnd_offline=False
. For example, on AntMaze:
python train_finetuning_supe.py --config.backup_entropy=False --env_name=antmaze-large-diverse-v2 --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --seed=1 --offline_ratio=0
The HILP skills were pretrained using the official codebase: https://github.com/seohongpark/HILP, and the pretrained checkpoints can be downloaded using create_env.sh
. To run the HILP baselines, use the train_finetuning_supe_hilp.py
and train_finetuning_supe_pixels_hilp.py
scripts with the same command parameters as Ours/Online w/ Trajectory Skills. For example, to benchmark on AntMaze, run the following command:
python train_finetuning_supe_hilp.py --config.backup_entropy=False --env_name=antmaze-large-diverse-v2 --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1
python train_finetuning_supe_hilp.py --config.backup_entropy=False --env_name=antmaze-large-diverse-v2 --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --seed=1 --offline_ratio=0
python train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --project_name=explore --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --env_name=antmaze-large-diverse-v2 --seed=1 --rnd_config.coeff=2
python train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --project_name=explore --offline_relabel_type=pred --use_rnd_offline=True --use_rnd_online=True --env_name=kitchen-mixed-v0 --seed=1 --rnd_config.coeff=2 --config.init_temperature=1.0
python train_finetuning_explore_pixels.py --config.backup_entropy=False --config.num_min_qs=1 --config.num_qs=10 --project_name=explore-pixels --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --seed=1 --env_name=antmaze-large-diverse-v2 --use_icvf=True --rnd_config.coeff=2
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --rnd_config.coeff=2 --config.discount=0.995 --env_name=cube-single-play-singletask-task1-v0 --seed=1
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --rnd_config.coeff=2 --config.discount=0.995 --env_name=cube-double-play-singletask-task1-v0 --seed=1
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --rnd_config.coeff=2 --config.discount=0.995 --env_name=scene-play-singletask-task1-v0 --seed=1
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --env_name=antsoccer-arena-navigate-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --config.discount=0.995
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=True --use_rnd_online=True --env_name=humanoidmaze-medium-navigate-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --config.discount=0.995
To run the Online baseline, use the same commands as for ExPLORe except add offline_ratio=0
and change use_rnd_offline=False
. For example, on AntMaze:
python train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --project_name=explore --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=antmaze-large-diverse-v2 --seed=1 --rnd_config.coeff=2 --offline_ratio=0
python train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --project_name=diff_bc_jsrl --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=antmaze-large-diverse-v2 --seed=1 --rnd_config.coeff=2 --offline_ratio=0 --jsrl_ratio=0.9 --jsrl_discount=0.99 --config.init_temperature=1.0
python train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --project_name=diff_bc_jsrl --offline_relabel_type=pred --use_rnd_offline=False --use_rnd_online=True --env_name=kitchen-mixed-v0 --seed=1 --rnd_config.coeff=2.0 --config.init_temperature=1.0 --offline_ratio=0 --jsrl_ratio=0.75
python train_finetuning_explore_pixels.py --config.backup_entropy=False --config.num_min_qs=1 --config.num_qs=10 --project_name=diff_bc_jsrl_pixels --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --seed=1 --env_name=antmaze-large-diverse-v2 --offline_ratio=0 --updates_per_step=2 --use_icvf=True --rnd_config.coeff=2 --jsrl_ratio=0.9
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=cube-single-play-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --offline_ratio=0 --jsrl_ratio=0.5 --jsrl_discount=0.995 --config.init_temperature=1.0 --config.discount=0.995
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=cube-double-play-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --offline_ratio=0 --jsrl_ratio=0.5 --jsrl_discount=0.995 --config.init_temperature=1.0 --config.discount=0.995
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=2 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=scene-play-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --offline_ratio=0 --jsrl_ratio=0.5 --jsrl_discount=0.995 --config.init_temperature=1.0 --config.discount=0.995
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=antsoccer-arena-navigate-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --offline_ratio=0 --jsrl_ratio=0.9 --jsrl_discount=0.995 --config.init_temperature=1.0 --config.discount=0.995
python3 train_finetuning_explore.py --config.backup_entropy=False --config.num_min_qs=1 --offline_relabel_type=min --use_rnd_offline=False --use_rnd_online=True --env_name=humanoidmaze-medium-navigate-singletask-task1-v0 --seed=1 --rnd_config.coeff=2 --offline_ratio=0 --jsrl_ratio=0.75 --jsrl_discount=0.995 --config.init_temperature=1.0 --config.discount=0.995
@inproceedings{
wilcoxson2025leveraging,
title={Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration},
author={Max Wilcoxson and Qiyang Li and Kevin Frans and Sergey Levine},
booktitle={International Conference on Machine Learning (ICML)},
year={2025},
url={https://arxiv.org/abs/2410.18076}
}