-
Notifications
You must be signed in to change notification settings - Fork 186
MDP Playground integration into bsuite #38
base: main
Are you sure you want to change the base?
Conversation
* Integrating MDP Playground environments into bsuite: Added MDP Playground environments and experiments into bsuite Analysis Jupyter notebook: Added a spoke in the bsuite spider plot for MDP Playground environments. Added additional analyses cells for individual MDP Playground experiments: Delay, Transition Noise, Reward Noise, Reward Sparsity, Rewardable Sequence Length * Removed Jupyter notebook conflicts with deepmind:master * Removed .gitignore conflicts with deepmind:master Co-authored-by: suresh-guttikonda <[email protected]> Co-authored-by: guttikon <[email protected]>
…eriments in bsuite.
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
@@ -43,8 +43,8 @@ | |||
# algorithm | |||
flags.DEFINE_integer('seed', 42, 'seed for random number generation') | |||
flags.DEFINE_integer('num_hidden_layers', 2, 'number of hidden layers') | |||
flags.DEFINE_integer('num_units', 64, 'number of units per hidden layer') | |||
flags.DEFINE_float('learning_rate', 1e-2, 'the learning rate') | |||
flags.DEFINE_integer('num_units', 50, 'number of units per hidden layer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do any agent changes in a separate commit
@@ -0,0 +1,108 @@ | |||
# pylint: disable=g-bad-file-header | |||
# Copyright 2019 .... All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021 :)
|
||
from mdp_playground.envs import RLToyEnv #mdp_playground | ||
|
||
# import collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the commented-out import
import numpy as np | ||
from typing import Any | ||
|
||
# def ohe_observation(obs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
# import collections | ||
from bsuite.experiments.mdp_playground import sweep | ||
from bsuite.environments import base | ||
from bsuite.utils.gym_wrapper import DMEnvFromGym, space2spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://google.github.io/styleguide/pyguide.html#22-imports
So the import would be:
from bsuite.utils import gym_wrapper
And the usage would be
gym_wrapper.DMEnvFromGym(...)
# def ohe_observation(obs): | ||
|
||
class DM_RLToyEnv(base.Environment): | ||
"""A wrapper to convert an RLToyEnv Gym environment from MDP Playground to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sufficient to write "A dm_env wrapper for the Gym RLToyEnv." here.
self.bsuite_num_episodes = sweep.NUM_EPISODES | ||
|
||
super(DM_RLToyEnv, self).__init__() | ||
# Convert gym action and observation spaces to dm_env specs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some more commented-out code here and below.
else: | ||
return dm_env.transition(dm_env_step.reward, ohe_obs) | ||
|
||
def _step(self, action: int) -> dm_env.TimeStep: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should implement _step()
and _reset()
directly, i.e. rename step()
and reset()
above. That way the reset-at-end-of-episode behaviour will work properly.
You can inherit from dm_env.Environment
and implement the step()
and reset()
methods, but you then have to do the book-keeping for the episode boundaries.
base.Environment which is a subclass of dm_env.Environment. | ||
Based on the DMEnvFromGym in gym_wrapper.py""" | ||
|
||
def __init__(self, max_episode_len: int = 100, **config: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rename config
to something like gym_make_kwargs
? Ideally the type would be something like Mapping[str, Any]
too.
# ============================================================================ | ||
"""Analysis for MDP Playground.""" | ||
|
||
###TODO change to mdpp stuff below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this TODO mean?
regret_score = plotting.ave_regret_score( | ||
df, baseline_regret=BASE_REGRET, episode=NUM_EPISODES) | ||
|
||
norm_score = 1.0 * regret_score # 2.5 was heuristically chosen value to get Sonnet DQN to score approx. 0.75, so that better algorithms like Rainbow can get score close to 1. With a bigger NN this would mean an unclipped score of 1.1 for Sonnet DQN, which is fair I think. However, a2c_rnn even reached 2.0 on this scale. DQN may be not performing as well because its epsilon is not annealed to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment I'm afraid. Was the value here 2.5 at some other point?
|
||
norm_score = 1.0 * regret_score # 2.5 was heuristically chosen value to get Sonnet DQN to score approx. 0.75, so that better algorithms like Rainbow can get score close to 1. With a bigger NN this would mean an unclipped score of 1.1 for Sonnet DQN, which is fair I think. However, a2c_rnn even reached 2.0 on this scale. DQN may be not performing as well because its epsilon is not annealed to 0. | ||
print("unclipped score:", norm_score) | ||
norm_score = np.clip(norm_score, 0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Prefer return np.clip(...)
rather than creating a variable and returning it on the next line.
class InterfaceTest(test_utils.EnvironmentTestMixin, absltest.TestCase): | ||
|
||
def make_object_under_test(self): | ||
config = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to create config
here. You can write
return mdp_playground.DM_RLToyEnv(
state_space_type="discrete",
action_space_type="discrete",
# ...etc
)
# Need to have full config, including: S, A,; explicitly state all of them for backward compatibility. | ||
|
||
config = {} | ||
# config["seed"] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's aim to remove all the commented-out stuff. I'll stop commenting on it here and let you have a look for other instances :)
|
||
_SETTINGS = [] | ||
delays = [0, 1, 2, 4, 8] | ||
for i in range(5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for delay in delays
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Raghu, I've read through some of the files and left a few initial comments, just on (Google) Python style and so on. I'm unsure about adding to the "main" bsuite.py
from the perspective of expanding dependencies. For example, the gym dependency did not exist before https://github.com/deepmind/bsuite#using-bsuite-in-openai-gym-format.
Let's go through Ian's thoughts via email first about the experiments themselves, then figure out where to go with this proposal.
Hi @iosband and @yotam,
Hope you're doing well!
Following our discussions, I added MDP Playground experiments into bsuite for the following dimensions of MDP Playground: Delay, Transition Noise, Reward Noise, Reward Sparsity, Rewardable Sequence Length
Here is a short summary of the changes we have made:
Added MDP Playground environments and experiments into bsuite
Updated analysis Jupyter notebook:
Added a guide on how to add new experiments into bsuite to
CONTRIBUTING.md
Updated setup.py to include
mdp-playground
as a dependencyChanged HPs for
A2C
because performance was very noisy with the old onesRemoved Jupyter notebook conflicts with
deepmind:master
Removed
.gitignore
conflicts withdeepmind:master
Improvements still needed:
Please let us know your inputs and feedback on what to do next!
Best regards,
Raghu Rajan.
Co-authored-by: suresh-guttikonda [email protected]
Co-authored-by: guttikon [email protected]