Skip to content
This repository has been archived by the owner on Oct 7, 2024. It is now read-only.

MDP Playground integration into bsuite #38

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

RaghuSpaceRajan
Copy link

Hi @iosband and @yotam,

Hope you're doing well!

Following our discussions, I added MDP Playground experiments into bsuite for the following dimensions of MDP Playground: Delay, Transition Noise, Reward Noise, Reward Sparsity, Rewardable Sequence Length

Here is a short summary of the changes we have made:

  • Added MDP Playground environments and experiments into bsuite
    Updated analysis Jupyter notebook:

    • Added a spoke in the bsuite spider plot for MDP Playground environments.
    • Added additional analyses cells for individual MDP Playground experiments:
      • Delay, Transition Noise, Reward Noise, Reward Sparsity, Rewardable Sequence Length
  • Added a guide on how to add new experiments into bsuite to CONTRIBUTING.md

  • Updated setup.py to include mdp-playground as a dependency

  • Changed HPs for A2C because performance was very noisy with the old ones

  • Removed Jupyter notebook conflicts with deepmind:master

  • Removed .gitignore conflicts with deepmind:master

Improvements still needed:

  • The code still needs to conform to the Python style that bsuite follows.
  • The tests probably need some improvement.

Please let us know your inputs and feedback on what to do next!

Best regards,
Raghu Rajan.

Co-authored-by: suresh-guttikonda [email protected]
Co-authored-by: guttikon [email protected]

RaghuSpaceRajan and others added 5 commits March 30, 2021 18:24
* Integrating MDP Playground environments into bsuite:
  Added MDP Playground environments and experiments into bsuite
  Analysis Jupyter notebook:
    Added a spoke in the bsuite spider plot for MDP Playground environments.
    Added additional analyses cells for individual MDP Playground experiments:
    Delay, Transition Noise, Reward Noise, Reward Sparsity, Rewardable Sequence Length

* Removed Jupyter notebook conflicts with deepmind:master

* Removed  .gitignore conflicts with deepmind:master


Co-authored-by: suresh-guttikonda <[email protected]>
Co-authored-by: guttikon <[email protected]>
@google-cla
Copy link

google-cla bot commented Mar 30, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label Mar 30, 2021
@sguttikon
Copy link

@googlebot I fixed it.

@google-cla
Copy link

google-cla bot commented Mar 31, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@sguttikon
Copy link

@googlebot I fixed it.

@google-cla google-cla bot added cla: yes and removed cla: no labels Mar 31, 2021
@RaghuSpaceRajan
Copy link
Author

Hi @iosband and @yotam,

Could you please restart the workflow tests? I managed to fix the failing pytype tests in the latest commit and pytest tests are also passing locally.

@@ -43,8 +43,8 @@
# algorithm
flags.DEFINE_integer('seed', 42, 'seed for random number generation')
flags.DEFINE_integer('num_hidden_layers', 2, 'number of hidden layers')
flags.DEFINE_integer('num_units', 64, 'number of units per hidden layer')
flags.DEFINE_float('learning_rate', 1e-2, 'the learning rate')
flags.DEFINE_integer('num_units', 50, 'number of units per hidden layer')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do any agent changes in a separate commit

@@ -0,0 +1,108 @@
# pylint: disable=g-bad-file-header
# Copyright 2019 .... All Rights Reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021 :)


from mdp_playground.envs import RLToyEnv #mdp_playground

# import collections
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented-out import

import numpy as np
from typing import Any

# def ohe_observation(obs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

# import collections
from bsuite.experiments.mdp_playground import sweep
from bsuite.environments import base
from bsuite.utils.gym_wrapper import DMEnvFromGym, space2spec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://google.github.io/styleguide/pyguide.html#22-imports

So the import would be:

from bsuite.utils import gym_wrapper

And the usage would be

gym_wrapper.DMEnvFromGym(...)

# def ohe_observation(obs):

class DM_RLToyEnv(base.Environment):
"""A wrapper to convert an RLToyEnv Gym environment from MDP Playground to a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sufficient to write "A dm_env wrapper for the Gym RLToyEnv." here.

self.bsuite_num_episodes = sweep.NUM_EPISODES

super(DM_RLToyEnv, self).__init__()
# Convert gym action and observation spaces to dm_env specs.
Copy link
Collaborator

@yotam yotam Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some more commented-out code here and below.

else:
return dm_env.transition(dm_env_step.reward, ohe_obs)

def _step(self, action: int) -> dm_env.TimeStep:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should implement _step() and _reset() directly, i.e. rename step() and reset() above. That way the reset-at-end-of-episode behaviour will work properly.

You can inherit from dm_env.Environment and implement the step() and reset() methods, but you then have to do the book-keeping for the episode boundaries.

base.Environment which is a subclass of dm_env.Environment.
Based on the DMEnvFromGym in gym_wrapper.py"""

def __init__(self, max_episode_len: int = 100, **config: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename config to something like gym_make_kwargs? Ideally the type would be something like Mapping[str, Any] too.

# ============================================================================
"""Analysis for MDP Playground."""

###TODO change to mdpp stuff below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this TODO mean?

regret_score = plotting.ave_regret_score(
df, baseline_regret=BASE_REGRET, episode=NUM_EPISODES)

norm_score = 1.0 * regret_score # 2.5 was heuristically chosen value to get Sonnet DQN to score approx. 0.75, so that better algorithms like Rainbow can get score close to 1. With a bigger NN this would mean an unclipped score of 1.1 for Sonnet DQN, which is fair I think. However, a2c_rnn even reached 2.0 on this scale. DQN may be not performing as well because its epsilon is not annealed to 0.
Copy link
Collaborator

@yotam yotam Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment I'm afraid. Was the value here 2.5 at some other point?


norm_score = 1.0 * regret_score # 2.5 was heuristically chosen value to get Sonnet DQN to score approx. 0.75, so that better algorithms like Rainbow can get score close to 1. With a bigger NN this would mean an unclipped score of 1.1 for Sonnet DQN, which is fair I think. However, a2c_rnn even reached 2.0 on this scale. DQN may be not performing as well because its epsilon is not annealed to 0.
print("unclipped score:", norm_score)
norm_score = np.clip(norm_score, 0, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Prefer return np.clip(...) rather than creating a variable and returning it on the next line.

class InterfaceTest(test_utils.EnvironmentTestMixin, absltest.TestCase):

def make_object_under_test(self):
config = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to create config here. You can write

return mdp_playground.DM_RLToyEnv(
    state_space_type="discrete",
    action_space_type="discrete",
    # ...etc
)

# Need to have full config, including: S, A,; explicitly state all of them for backward compatibility.

config = {}
# config["seed"] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's aim to remove all the commented-out stuff. I'll stop commenting on it here and let you have a look for other instances :)


_SETTINGS = []
delays = [0, 1, 2, 4, 8]
for i in range(5):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for delay in delays?

Copy link
Collaborator

@yotam yotam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Raghu, I've read through some of the files and left a few initial comments, just on (Google) Python style and so on. I'm unsure about adding to the "main" bsuite.py from the perspective of expanding dependencies. For example, the gym dependency did not exist before https://github.com/deepmind/bsuite#using-bsuite-in-openai-gym-format.

Let's go through Ian's thoughts via email first about the experiments themselves, then figure out where to go with this proposal.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants