Themis is an open-source testing & evaluation framework for Reinforcement Learning experiments using PyTorch. It supports many famous environments and can automatically set up the RL algorithm for continuous or discrete action spaces with minimal intervention from the user. You can specify Themis to train a reward model from human preferences using our web-based crowdsourcing platform, making it ideal for experiments with interactive nature. It is also designed with explainability in mind and offers 3 ready-to-play methods from Captum.
This repo contains the RLHF system. For our web-based croudsourcing platform visit: https://anonymous.4open.science/r/rlhf_crowdsourcing_platform-B862 (Open in new tab).
Pytorch
Captum
Gymnassium
Minigrid
Hydra
termcolor
moviepy
Matplotlib
Pandas
- MuJoCo (eg. domain=Control, env=Humanoid-v4)
- Atari (eg. domain=ALE, env=Breakout-v5)
- Box2d (eg. domain=Box2d, env=Humanoid-v4)
- Minigrid (eg. domain=Minigrid, env=DistShift1-v0)
- BabyAI (eg. domain=BabyAI, env=GoToRedBallGrey-v0)
You can manually add more environments as long as they follow Gym format.
- Uniform Sampling (feed_type=0)
- Disagreement Sampling (feed_type=1)
- Entropy Sampling (feed_type=2)
- K Center (feed_type=3)
- K Center + Disagreement (feed_type=4)
- K Center + Entropy (feed_type=5)
Experiments can be executed with the following scripts:
./themis_pretrain.sh
./themis_train.sh
Edit the files accordigly to specify changes in the experiment configuration.
To run experiment using a learned reward model set the flag learn_reward
to True. Otherwise the environment reward will be used.
Be sure to change the flag human_teacher
to True.
The method get_labels
in the file reward_model.py
contains the logic to generate clips ang receive input from the user. Explore the available tools from the lib/human_interface.py
.
To use the explainable techniques currently supported set either the xplain_action
or xplain_state
flag to True. Refer to lib/human_interface.py
if you want to add more.
Themis is based on BPref, so it incorporates the same logic toward the synthetic teachers. To tweak the synthetic teacher tweak the relevant parameters in config/train_themis.py
:
teacher_beta: rationality constant of stochastic preference model (default: -1 for perfectly rational model)
teacher_gamma: discount factor to model myopic behavior (default: 1)
teacher_eps_mistake: probability of making a mistake (default: 0)
teacher_eps_skip: hyperparameters to control skip threshold (\in [0,1])
teacher_eps_equal: hyperparameters to control equal threshold (\in [0,1])
Synthetic teacher examples:
Oracle teacher
: (teacher_beta=-1, teacher_gamma=1, teacher_eps_mistake=0, teacher_eps_skip=0, teacher_eps_equal=0)
Mistake teacher
: (teacher_beta=-1, teacher_gamma=1, teacher_eps_mistake=0.1, teacher_eps_skip=0, teacher_eps_equal=0)
Noisy teacher
: (teacher_beta=1, teacher_gamma=1, teacher_eps_mistake=0, teacher_eps_skip=0, teacher_eps_equal=0)
Skip teacher
: (teacher_beta=-1, teacher_gamma=1, teacher_eps_mistake=0, teacher_eps_skip=0.1, teacher_eps_equal=0)
Myopic teacher
: (teacher_beta=-1, teacher_gamma=0.9, teacher_eps_mistake=0, teacher_eps_skip=0, teacher_eps_equal=0)
Equal teacher
: (teacher_beta=-1, teacher_gamma=1, teacher_eps_mistake=0, teacher_eps_skip=0, teacher_eps_equal=0.1)