Tongzhou Wang, Simon S. Du, Antonio Torralba, Phillip Isola, Amy Zhang, Yuandong Tian
We provide a PyTorch implementation of Denoised MDPs: Learning World Models Better Than The World Itself, published in ICML 2022.
(We also provide a PyTorch implementation of Dreamer that is carefully written and verified to reproduce results. See here for usages.)
The raw real world is noisy. How can reinforcement learning agent successfully learn with such raw data, where signals can be strongly entangled with noises? Denoised MDP characterizes information into four distinct types, based on controllability and relation with rewards, and proposes to extract a state representation space containing only information both controllable and reward-relevant. Under this view, several prior works can be seen as insufficiently removing noisy information.
To properly extract only the useful signal, Denoised MDP considers novel factorized MDP transition structures, where signal representation and noise representation are separated into distinct latent spaces. The state abstraction (i.e., representation learning) problem is turned into a regularized model fitting problem: fitting the factorized forward model to collected trajectories, while requiring the signal latents to be minimally informative of the raw observations.
The resulting variational formulation (derivation in paper) successfully disentangles a variety of noise types (and also noiseless settings), outperforming baseline methods that often can only do well for certain particular noise types.
For environments with distinct types of noises, we visualize latent factorization idenfitied by Denoised MDP, and other baseline methods. Only Denoised MDP successfully disentangle signal from noises across all environments.
-
Task: Press green button to shift TV hue to green (a RoboDesk variant).
True Signal: Robot joint position, TV green-ness, Green light on desk.
True Noise: Lighting, Camera, TV content, Imperfect sensor.robodesk_tv_green_hue_noisy.mp4
-
Task: Make reacher robot touch the target red object.
True Signal: Robot joint position, Target location.
True Noise: Background.dmc_reacher_easy_video_background.mp4
-
Task: Make the walker robot move forward when sensor readings are noisily affected by background images.
True Signal: Robot joint position.
True Noise: Background, Imperfect sensor.dmc_walker_walk_video_background_noisy_sensor.mp4
-
Task: Move half cheetah robot forward when camera is shaky.
True Signal: Robot joint position.
True Noise: Background, Camera.dmc_cheetah_run_video_background_camera_jitter.mp4
The code has been tested on
- CUDA 11 with NVIDIA RTX Titan, NVIDIA 2080Ti, and NVIDIA Titan XP,
mujoco=2.2.0
withegl
renderer.
Software dependencices (also in requirements.txt
):
torch>=1.9.0
tqdm
numpy>=1.17.0
PIL
tensorboardX>=2.5
attrs>=21.4.0
hydra-core==1.2.0
omegaconf==2.2.1
mujoco
dm_control
The code supports the following environments:
kind |
spec |
Description |
---|---|---|
robodesk |
${TASK_NAME} or ${TASK_NAME}_noisy (e.g., tv_green_hue_noisy ) |
RoboDesk environment (96x96 resolution) with a diverse set of distractors (when using ${TASK_NAME}_noisy variant). The distractors are implemented and descriped in details at this RoboDesk fork. |
dmc |
${DOMAIN_NAME}_${TASK_NAME}_${VARIANT} with VARIANT being one of [noiseless, video_background, video_background_noisy_sensor, video_background_camera_jitter] (e.g., cheetah_run_video_background_camera_jitter ) |
DeepMind Control (DMC) environment (64x64 resolution) with four possible variants, representing different types of noises. |
In the paper, we used the following 13 environments, all run with 1000 max episode length and action repeat of 2:
kind |
spec |
---|---|
robodesk |
tv_green_hue_noisy |
dmc |
${DOMAIN_NAME}_${TASK_NAME} in [cheetah_run, walker_walk, reacher_easy] each with all 4 VARIANT options |
NOTE: All noisy environments require the driving_car
class of the Kinetics-400 training dataset. Some instructions for downloading the dataset can be found here. After downloading, you may either place it under ~kinetics/070618/400
(so that the videos are ~kinetics/070618/400/train/driving_car/*.mp4
) or specify KINETICS_DIR
environment variatble (so that the videos are ${KINETICS_DIR}/train/driving_car/*.mp4
).
env CUDA_VISIBLE_DEVICES=0 \ # GPU ID for training
EGL_DEVICE_ID=0 \ # GPU ID for rendering
KINETICS_DIR=/path/to/kinectics/ \ # Videos for noisy env
python main.py \
env.kind=robodesk \ # Env kind
env.spec=tv_green_hue_noisy \ # Env spec
learning.model_learning.kl.alpha=2 \ # alpha, weight of the KL terms
learning.model_learning.kl.beta_y=0.125 \ # beta, smaller => stronger regularization
learning.model_learning.kl.beta_z=0.125 \ # beta, smaller => stronger regularization
seed=12 \ # Seed
output_folder=subdir/for/output/ # [Optional] subdirectory under `./results`
# for storing outputs. If not given, a folder name will
# be automatically constructed with information from
# given config
Hyperparameter choice (see also Appendix A.2 for more details):
alpha
parameter is selected proportional to the size of observation. For DMC (64x64x3
observations), we usealpha=1
. For RoboDesk (96x96x3
observations), we usealpha=2
.beta
parameter (fory
andz
component) controls the regularization strength, and should be set in(0, 1)
. Noisier environments benefit from a smaller value.
Default behaviors:
- Train the Figure 2b Denoised MDP variant over
10^6
environment steps, with5000
steps prefilling the replay buffer, and then training for100
iterations for every100
steps. - Optimize policy bybackpropagation through dynamics (Dreamer-style). One can switch to Soft Actor-Critic via specifying
learning/policy_learning=sac
(note the/
rather than.
). - Evaluate for
10
episodes every10000
steps. - Visualize for
3
episodes (both full reconstruction and with noise latent fixed) every20000
steps.
We use Hydra to handle argument specification. You can use Hydra's overriding syntax to specify all sorts of config options. See config.py
for the complete set of options. Additionally, one can also check the config.yaml
file generated in output directory for all options.
To use the Figure 2c variant with three sets of latents x
, y
, and z
, set
learning.model.transition.z.belief_size={NONZERO_Z_BELIEF_SIZE} \
learning.model.transition.z.state_size={NONZERO_Z_STATE_SIZE} \
for some non-zero belief and state sizes for the z
latent component.
In the paper Appendix B.5, a variant similar to Figure 2c is compared, with z
prior not depending on y
. To reproduce those results on DeepMind Control Suite environments, set
learning.model.transition.x.belief_size=120 \
learning.model.transition.x.state_size=20 \
learning.model.transition.y.belief_size=70 \
learning.model.transition.y.state_size=10 \
learning.model.transition.z.belief_size=70 \
learning.model.transition.z.state_size=10 \
learning.model.transition.z_prior_uses_y=False \
When y
and z
latent spaces are completely turned off (i.e., empty), the code essentially is Dreamer. This can be done by setting
learning.model.transition.x.belief_size=200 \ # give `x` the dimensionality specified in Dreamer paper
learning.model.transition.x.state_size=30 \
learning.model.transition.y.belief_size=0 \
learning.model.transition.y.state_size=0 \
learning.model.transition.z.belief_size=0 \
learning.model.transition.z.state_size=0 \
To facilitate easier parsing and usage of this repository, we provide a detailed note on how our code is structured here.
Upon receiving SIGUSR1
, the provided code starts writing all necessary states (including replay buffer) into a folder under the output directory (usually taking up to 10 minutes), and exits naturally afterwards. When the code is run with the same output directory, it continues from that state (and deletes the saved state). This may be particularly useful if you are running on a shared cluster.
Tongzhou Wang, Simon S. Du, Antonio Torralba, Phillip Isola, Amy Zhang, Yuandong Tian. "Denoised MDPs: Learning World Models Better Than The World Itself" International Conference on Machine Learning. 2022.
@inproceedings{wang2022denoisedmdps,
title={Denoised MDPs: Learning World Models Better Than The World Itself},
author={Wang, Tongzhou and Du, Simon S. and Torralba, Antonio and Isola, Phillip and Zhang, Amy and Tian, Yuandong},
booktitle={International Conference on Machine Learning},
organization={PMLR},
year={2022}
}
If you find the RoboDesk distractor options (see this repository for more options and details) useful for your research, please also cite the following:
Click to show RoboDesk distractor bibtex
!
@misc{wang2022robodeskdistractor,
author = {Tongzhou Wang},
title = {RoboDesk with A Diverse Set of Distractors},
year = {2022},
howpublished = {\url{https://github.com/SsnL/robodesk}},
}
@misc{kannan2021robodesk,
author = {Harini Kannan and Danijar Hafner and Chelsea Finn and Dumitru Erhan},
title = {RoboDesk: A Multi-Task Reinforcement Learning Benchmark},
year = {2021},
howpublished = {\url{https://github.com/google-research/robodesk}},
}
For questions about the code provided in this repository, please open an GitHub issue.
For questions about the paper, please contact Tongzhou Wang (tongzhou _AT_ mit _DOT_ edu
).
This repo is under CC BY-NC 4.0. Please check LICENSE file.