Project Website · Paper · Datasets · Clean Offline RLHF
This is the Uni-RLHF platform implementation of the paper Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback by Yifu Yuan, Jianye Hao, Yi Ma, Zibin Dong, Hebin Liang, Jinyi Liu, Zhixin Feng, Kai Zhao, Yan Zheng with contributions by Thomas Frank. Uni-RLHF aims to provide a complete workflow from real human feedback, fostering progress in the development of RLHF in decision making domain. Here we develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated dataset (≈15 million steps). Also, we provide offline RLHF baselines using collected feedback datasets and various design choice in the Clean Offline RLHF.
Table of Contents
The Uni-RLHF platform consists of a vue front-end and a flask back-end. Also, we support a wide range of mainstream RL environments for annotation. The system only works on Linux.
- Clone the repo
git clone https://github.com/thomas475/Uni-RLHF.git cd Uni-RLHF - Setup Anaconda environment
conda create -n rlhf python==3.9 conda activate rlhf
- Install Dependencies
pip install -r requirements.txt
- Install NPM packages
cd uni_rlhf/vue_part npm install - Install hdf5
conda install anaconda::hdf5
- Install Redis
sudo apt-get install redis
- Configure the MySQL Database (adjust [user] and [password] accordingly)
Then, in the MySQL environment enter:
mysql -u [user] -pAfterwards, navigate toCREATE DATABASE uni_rlhf; exitscripts/create_table.pyand update thecfgvariable:Then, execute the script:cfg = { 'host': 'localhost', 'port': 3306, 'username': [user], 'password': [password], 'database_name': 'uni_rlhf' }Finally, go tocd scripts py create_table.pyuni_rlhf/config.pyand setapp.config['SQLALCHEMY_DATABASE_URI']to'mysql://[user]:[password]@localhost/uni_rlhf'.
Many of the datasets use MuJoCo as environment, so it should be installed, too. See this for further details.
- Download the MuJoCo library:
wget https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz
- Create the MuJoCo folder:
mkdir ~/.mujoco - Extract the library to the MuJoCo folder:
tar -xvf mujoco210-linux-x86_64.tar.gz -C ~/.mujoco/ - Add environment variables (run
nano ~/.bashrc):export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin export MUJOCO_GL=egl
- Reload the .bashrc file to register the changes.
source ~/.bashrc
- Install dependencies:
conda install -c conda-forge patchelf fasteners cython==0.29.37 cffi pyglfw libllvm11 imageio glew glfw mesalib sudo apt-get install libglew-dev
- Test that the library is installed.
cd ~/.mujoco/mujoco210/bin ./simulate ../model/humanoid.xml
Uni-RLHF supports the following classic datasets, a full list of all tasks is available here. Uni-RLHF also supports the uploading of customizaton datasets, as long as the dataset contains observations and terminals keys.
-
Install D4RL dependencies. Note that we adapted the code to make the kitchen tasks work and made some small changes to the camera view for better visualisations.
cd d4rl pip install -e .
-
Install Atari dependencies.
pip install git+https://github.com/takuseno/d4rl-atari
-
Install V-D4RL dependencies. Note that v-d4rl provide image datasets and full datasets can be found on GoogleDrive. These must be downloaded before running the code. And the right file structure is:
uni_rlhf └───datasets │ └───dataset_resource | └───vd4rl | | └───cheetah | | │ └───cheetah_run_medium | | │ └───cheetah_run_medium_expert | | └───humanoid | | | |───humanoid_walk_medium | | │ └───humanoid_walk_medium_expert | | └───walker | | |───walker_walk_medium | | └───walker_walk_medium_expert | └───smarts | └───cruise | └───curin | └───left_c └───vue_part │ ... └───controllers │ ...
-
Install MiniGrid dependencies. There are the same dependencies as the D4RL datasets.
-
Install SMARTS dependencies. We employed online reinforcement learning algorithms to train two agents for datasets collection, each designed specifically for the respective scenario. The first agent demonstrates medium driving proficiency, achieving a success rate ranging from 40% to 80% in its designated scenario. In contrast, the second agent exhibits expert-level performance, attaining a success rate of 95% or higher in the same scenario. For dataset construction, 800 driving trajectories were collected using the intermediate agent, while an additional 200 were gathered via the expert agent. By integrating the two datasets, we compiled a mixed dataset encompassing 1,000 driving trajectories. We upload full datasets containing image (for rendering) and vector (for training) on GoogleDrive. These must be downloaded before running the code. And the right file structure is the same as v-d4rl dataset.
-
Upload customization datasets. The customization datasets must be
h5dfformat and containobservationsandterminalkeys:observations: An N by observation dimensional array of observations. terminals: An N dimensional array of episode termination flags.
Start the MySQL server:
sudo systemctl start mysql.service
mysql -u [user] -p
Start the redis server:
sudo service redis-server stop
redis-server
Now you can run the app from the base directory with:
conda activate rlhf
python run.pyThe app is running at:
http://localhost:8503You can kill all relative processes with:
python scripts/kill_process.pyIf you run the application remotely, you can access the interface through either port forwarding or public access. Make sure that port 8502 and 8503 are open on the server.
Open the terminal on your local machine and enter the following (adjust [user] and [server-ip] accordingly):
ssh -L 8502:localhost:8502 -L 8503:localhost:8503 [user]@[server-ip]
Then you can access the interface at http://localhost:8503.
You can also let the application be publically accessible through its ip address. For this you need to modify the following variables before running the platform (adjust [server-ip] accordingly):
- in
uni_rlhf/config.jsonset the value ofbaseUrlto[server-ip] - in
uni_rlhf/vue_part/src/store/index.jsin the jsoninitialStateset the value ofbaseUrltohttp://[server-ip]:8502
Then you can access the interface at http://[server-ip]:8503.
NOTE: If
[server-ip]is an ipv6-address and you use it in an http-request, you need to put brackets around it. For example, if the ipv6-address is0:0:0:0:0:0:0:1the corresponding http-request would behttp://[0:0:0:0:0:0:0:1]:8502.
- Specially tailored pipelines and tasks for reinforcement learning and decision-making problem.
- A clean pipeline designed for employer-annotators coordination
- Supports multi-user synchronised labeling and export with no conflict.
- Supports a large number of mainstream decision-making datasets and easily cumstomize and upload your own datasets.
- Supports serveral mainstream feedback types for decision making problem and provide configurable label formats let you combining new ways of giving feedback.
We support serveral build-in environments and datasets. See config for expected name formatting for full domains and tasks.
We support five common feedback types, propose a standardized feedback encoding format how annotators interact with these types and how they can be encoded. Additionally, we briefly outline the potential forms and applications of reinforcement learning that integrate various forms of human feedback in the Uni-RLHF paper.
Thanks to Uni-RLHF, we establish a systematic pipeline of crowdsourced annotations, resulting in an open-source and reuseable large-scale annotated dataset (≈15 million steps). Then, we conduct offline RL baselines using collected feedback datasets, we refer to offline RLHF baselines in the sister repository. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions for decision making based on realistic human feedback.
For more examples, please refer to the Documentation
- Support auto reward model training process
- Fix online training bug
- Adapting the sampler in the new code framework
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt for more information.
For any questions, please feel free to email [email protected].
If you find our work useful, please consider citing:
@inproceedings{anonymous2023unirlhf,
title={Uni-{RLHF}: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback},
author={Yuan, Yifu and Hao, Jianye and Ma, Yi and Dong, Zibin and Liang, Hebin and Liu, Jinyi and Feng, Zhixin and Zhao, Kai and Zheng, Yan}
booktitle={The Twelfth International Conference on Learning Representations, ICLR},
year={2024},
url={https://openreview.net/forum?id=WesY0H9ghM},
}



