feat: replace OpenAi gym with Gymnasium (#255)

This commit replaces the (unmaintained) [OpenAi gym](https://github.com/openai/gym) package with the new [Gymnasium](https://github.com/Farama-Foundation/Gymnasium) package. BREAKING CHANGE: This package now depends on Gymnasium instead of Gym. Furthermore, it also requires Gymnasium>=26 (see https://gymnasium.farama.org/content/migration-guide/) for more information.
rickstaa · Jun 22, 2023 · 9873a03 · 9873a03
1 parent 2a71272
commit 9873a03
Show file tree

Hide file tree

Showing 38 changed files with 274 additions and 274 deletions.
diff --git a/README.md b/README.md
@@ -10,8 +10,8 @@
 
 Welcome to the Bayesian Learning Control (BLC) framework! The Bayesian Learning Control framework enables you to automatically create, train and deploy various safe (stable and robust) Reinforcement Learning (RL) and Imitation learning (IL) control algorithms directly from real-world data. This framework is made up of four main modules:
 
-*   [Modeling](./bayesian_learning_control/modeling): Module that uses state of the art System Identification and State Estimation techniques to create an Openai gym environment out of real data.
-*   [Control](./bayesian_learning_control/control): Module used to train several [Bayesian Learning Control](https://rickstaa.github.io/bayesian-learning-control/control/control.html) RL/IL agents on the built gym environments.
+*   [Modeling](./bayesian_learning_control/modeling): Module that uses state of the art System Identification and State Estimation techniques to create an [gymnasium environment](https://gymnasium.farama.org/) out of real data.
+*   [Control](./bayesian_learning_control/control): Module used to train several [Bayesian Learning Control](https://rickstaa.github.io/bayesian-learning-control/control/control.html) RL/IL agents on the built [gymnasium](https://gymnasium.farama.org/) environments.
 *   [Hardware](./bayesian_learning_control/hardware): Module that can be used to deploy the trained RL/IL agents onto the hardware of your choice.
 
 This framework follows a code structure similar to the [Spinningup](https://spinningup.openai.com/en/latest/) educational package. By doing this, we hope to make it easier for new researchers to get started with our Algorithms. If you are new to RL, you are therefore highly encouraged first to check out the SpinningUp documentation and play with before diving into our codebase. Our implementation sometimes deviates from the [Spinningup](https://spinningup.openai.com/en/latest/) version to increase code maintainability, extensibility and readability.

diff --git a/bayesian_learning_control/control/README.md b/bayesian_learning_control/control/README.md
@@ -5,7 +5,7 @@ The following algorithms are implemented in the Bayesian Learning Control packag
 *   [Soft Actor-Critic (SAC)](https://rickstaa.github.io/bayesian-learning-control/control/algorithms/sac.html)
 *   [Lyapunov Actor-Critic (LAC)](https://rickstaa.github.io/bayesian-learning-control/control/algorithms/lac.html)
 
-They are all implemented with [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron) (non-recurrent) actor-critics, making them suitable for fully-observed, non-image-based RL environments, e.g. the [Gym Mujoco](https://gym.openai.com/envs/#mujoco) environments.
+They are all implemented with [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron) (non-recurrent) actor-critics, making them suitable for fully-observed, non-image-based RL environments, e.g. the [gymnasium Mujoco](https://gymnasium.farama.org/environments/mujoco/) environments.
 
 Bayesian Learning Control has two implementations for each algorithm: one that uses [PyTorch](https://pytorch.org/) as the neural network library, and one that uses [Tensorflow v2](https://www.tensorflow.org/) as the neural network library. The default backend is [Pytorch](https://pytorch.org). Please run the `pip install .[tf]` command if you want to use
 the [Tensorflow v2](https://www.tensorflow.org/) implementations.
diff --git a/bayesian_learning_control/control/algos/pytorch/lac/lac.py b/bayesian_learning_control/control/algos/pytorch/lac/lac.py
@@ -28,7 +28,7 @@
 from copy import deepcopy
 from pathlib import Path
 
-import gym
+import gymnasium as gym
 import numpy as np
 import torch
 import torch.nn as nn
@@ -121,10 +121,10 @@ def __init__(  # noqa: C901
         """Lyapunov (soft) Actor-Critic (LAC)
 
         Args:
-            env (:obj:`gym.env`): The gym environment the LAC is training in. This is
+            env (:obj:`gym.env`): The gymnasium environment the LAC is training in. This is
                 used to retrieve the activation and observation space dimensions. This
                 is used while creating the network sizes. The environment must satisfy
-                the OpenAI Gym API.
+                the gymnasium API.
             actor_critic (torch.nn.Module, optional): The constructor method for a
                 Torch Module with an ``act`` method, a ``pi`` module and several
                 ``Q`` or ``L`` modules. The ``act`` method and ``pi`` module should
@@ -210,10 +210,10 @@ def __init__(  # noqa: C901
             k: v for k, v in locals().items() if k not in ["self", "__class__", "env"]
         }
 
-        # Validate gym env
+        # Validate gymnasium env
         # NOTE: The current implementation only works with continuous spaces.
         if not is_gym_env(env):
-            raise ValueError("Env must be a valid Gym environment.")
+            raise ValueError("Env must be a valid gymnasium environment.")
         if is_discrete_space(env.action_space) or is_discrete_space(
             env.observation_space
         ):
@@ -829,7 +829,7 @@ def lac(  # noqa: C901
 
     Args:
         env_fn: A function which creates a copy of the environment.
-            The environment must satisfy the OpenAI Gym API.
+            The environment must satisfy the gymnasium API.
         actor_critic (torch.nn.Module, optional): The constructor method for a
             Torch Module with an ``act`` method, a ``pi`` module and several
             ``Q`` or ``L`` modules. The ``act`` method and ``pi`` module should
@@ -956,6 +956,32 @@ def lac(  # noqa: C901
 
     validate_args(**locals())
 
+    env = env_fn()
+
+    # Validate gymnasium env
+    # NOTE: The current implementation only works with continuous spaces.
+    if not is_gym_env(env):
+        raise ValueError("Env must be a valid gymnasium environment.")
+    if is_discrete_space(env.action_space) or is_discrete_space(env.observation_space):
+        raise NotImplementedError(
+            "The LAC algorithm does not yet support discrete observation/action "
+            "spaces. Please open a feature/pull request on "
+            "https://github.com/rickstaa/bayesian-learning-control/issues if you "
+            "need this."
+        )
+
+    env = gym.wrappers.FlattenObservation(
+        env
+    )  # NOTE: Done to make sure the alg works with dict observation spaces
+    if num_test_episodes != 0:
+        test_env = env_fn()
+        test_env = gym.wrappers.FlattenObservation(test_env)
+    obs_dim = env.observation_space.shape
+    act_dim = env.action_space.shape[0]
+    rew_dim = (
+        env.reward_range.shape[0] if isinstance(env.reward_range, gym.spaces.Box) else 1
+    )
+
     logger_kwargs["verbose_vars"] = (
         logger_kwargs["verbose_vars"]
         if (
@@ -980,19 +1006,6 @@ def lac(  # noqa: C901
     }  # Retrieve hyperparameters (Ignore logger object)
     logger.save_config(hyper_paramet_dict)  # Write hyperparameters to logger
 
-    env = env_fn()
-    env = gym.wrappers.FlattenObservation(
-        env
-    )  # NOTE: Done to make sure the alg works with dict observation spaces
-    if num_test_episodes != 0:
-        test_env = env_fn()
-        test_env = gym.wrappers.FlattenObservation(test_env)
-    obs_dim = env.observation_space.shape
-    act_dim = env.action_space.shape[0]
-    rew_dim = (
-        env.reward_range.shape[0] if isinstance(env.reward_range, gym.spaces.Box) else 1
-    )
-
     # Retrieve max episode length
     if max_ep_len is None:
         max_ep_len = env.env._max_episode_steps
@@ -1020,9 +1033,6 @@ def lac(  # noqa: C901
         torch.manual_seed(seed)
         np.random.seed(seed)
         random.seed(seed)
-        env.seed(seed)
-        if num_test_episodes != 0:
-            test_env.seed(seed)
 
     policy = LAC(
         env,
@@ -1134,7 +1144,8 @@ def lac(  # noqa: C901
 
     # Main loop: collect experience in env and update/log each epoch
     start_time = time.time()
-    o, ep_ret, ep_len = env.reset(), 0, 0
+    o, _ = env.reset()
+    ep_ret, ep_len = 0, 0
     for t in range(total_steps):
         # Until start_steps have elapsed, randomly sample actions
         # from a uniform distribution for better exploration. Afterwards,
@@ -1145,24 +1156,20 @@ def lac(  # noqa: C901
             a = env.action_space.sample()
 
         # Take step in the env
-        o_, r, d, _ = env.step(a)
+        o_, r, d, truncated, _ = env.step(a)
         ep_ret += r
         ep_len += 1
 
-        # Ignore the "done" signal if it comes from hitting the time
-        # horizon (that is, when it's an artificial terminal signal
-        # that isn't based on the agent's state)
-        d = False if ep_len == max_ep_len else d
-
         replay_buffer.store(o, a, r, o_, d)
 
         # Make sure to update most recent observation!
         o = o_
 
         # End of trajectory handling
-        if d or (ep_len == max_ep_len):
+        if d or truncated:
             logger.store(EpRet=ep_ret, EpLen=ep_len)
-            o, ep_ret, ep_len = env.reset(), 0, 0
+            o, _ = env.reset()
+            ep_ret, ep_len = 0, 0
 
         # Update handling
         if (t + 1) >= update_after and ((t + 1) - update_after) % update_every == 0:
@@ -1316,18 +1323,15 @@ def lac(  # noqa: C901
 
 
 if __name__ == "__main__":
-    # NOTE: You can import your custom gym environment here.
-    # import stable_gym  # noqa: F401
-
     parser = argparse.ArgumentParser(
         description="Trains a LAC agent in a given environment."
     )
     parser.add_argument(
         "--env",
         type=str,
-        default="Oscillator-v1",
-        help="the gym env (default: Oscillator-v1)",
-    )
+        default="stable_gym:Oscillator-v1",
+        help="the gymnasium env (default: stable_gym:Oscillator-v1)",
+    )  # NOTE: Ensure the environment is installed in the current python environment.
     parser.add_argument(
         "--hid_a",
         type=int,

diff --git a/bayesian_learning_control/control/algos/pytorch/policies/lyapunov_actor_critic.py b/bayesian_learning_control/control/algos/pytorch/policies/lyapunov_actor_critic.py
@@ -48,8 +48,8 @@ def __init__(
         network object.
 
         Args:
-            observation_space (:obj:`gym.space.box.Box`): A gym observation space.
-            action_space (:obj:`gym.space.box.Box`): A gym action space.
+            observation_space (:obj:`gym.space.box.Box`): A gymnasium observation space.
+            action_space (:obj:`gym.space.box.Box`): A gymnasium action space.
             hidden_sizes (Union[dict, tuple, list], optional): Sizes of the hidden
                 layers for the actor. Defaults to ``(256, 256)``.
             activation (Union[:obj:`dict`, :obj:`torch.nn.modules.activation`], optional):

diff --git a/bayesian_learning_control/control/algos/pytorch/policies/soft_actor_critic.py b/bayesian_learning_control/control/algos/pytorch/policies/soft_actor_critic.py
@@ -45,8 +45,8 @@ def __init__(
         object.
 
         Args:
-            observation_space (:obj:`gym.space.box.Box`): A gym observation space.
-            action_space (:obj:`gym.space.box.Box`): A gym action space.
+            observation_space (:obj:`gym.space.box.Box`): A gymnasium observation space.
+            action_space (:obj:`gym.space.box.Box`): A gymnasium action space.
             hidden_sizes (Union[dict, tuple, list], optional): Sizes of the hidden
                 layers for the actor. Defaults to ``(256, 256)``.
             activation (Union[:obj:`dict`, :obj:`torch.nn.modules.activation`], optional):

diff --git a/bayesian_learning_control/control/algos/pytorch/sac/sac.py b/bayesian_learning_control/control/algos/pytorch/sac/sac.py
@@ -30,7 +30,7 @@
 from copy import deepcopy
 from pathlib import Path
 
-import gym
+import gymnasium as gym
 import numpy as np
 import torch
 import torch.nn as nn
@@ -118,10 +118,10 @@ def __init__(  # noqa: C901
         """Soft Actor-Critic (SAC)
 
         Args:
-            env (:obj:`gym.env`): The gym environment the SAC is training in. This is
+            env (:obj:`gym.env`): The gymnasium environment the SAC is training in. This is
                 used to retrieve the activation and observation space dimensions. This
                 is used while creating the network sizes. The environment must satisfy
-                the OpenAI Gym API.
+                the gymnasium API.
             actor_critic (torch.nn.Module, optional): The constructor method for a
                 Torch Module with an ``act`` method, a ``pi`` module and several
                 ``Q`` or ``L`` modules. The ``act`` method and ``pi`` module should
@@ -203,10 +203,10 @@ def __init__(  # noqa: C901
             k: v for k, v in locals().items() if k not in ["self", "__class__", "env"]
         }
 
-        # Validate gym env
+        # Validate gymnasium env
         # NOTE: The current implementation only works with continuous spaces.
         if not is_gym_env(env):
-            raise ValueError("Env must be a valid Gym environment.")
+            raise ValueError("Env must be a valid gymnasium environment.")
         if is_discrete_space(env.action_space) or is_discrete_space(
             env.observation_space
         ):
@@ -770,7 +770,7 @@ def sac(  # noqa: C901
 
     Args:
         env_fn: A function which creates a copy of the environment.
-            The environment must satisfy the OpenAI Gym API.
+            The environment must satisfy the gymnasium API.
         actor_critic (torch.nn.Module, optional): The constructor method for a
             Torch Module with an ``act`` method, a ``pi`` module and several
             ``Q`` or ``L`` modules. The ``act`` method and ``pi`` module should
@@ -893,6 +893,32 @@ def sac(  # noqa: C901
 
     validate_args(**locals())
 
+    env = env_fn()
+
+    # Validate gymnasium env
+    # NOTE: The current implementation only works with continuous spaces.
+    if not is_gym_env(env):
+        raise ValueError("Env must be a valid gymnasium environment.")
+    if is_discrete_space(env.action_space) or is_discrete_space(env.observation_space):
+        raise NotImplementedError(
+            "The SAC algorithm does not yet support discrete observation/action "
+            "spaces. Please open a feature/pull request on "
+            "https://github.com/rickstaa/bayesian-learning-control/issues if you "
+            "need this."
+        )
+
+    env = gym.wrappers.FlattenObservation(
+        env
+    )  # NOTE: Done to make sure the alg works with dict observation spaces
+    if num_test_episodes != 0:
+        test_env = env_fn()
+        test_env = gym.wrappers.FlattenObservation(test_env)
+    obs_dim = env.observation_space.shape
+    act_dim = env.action_space.shape[0]
+    rew_dim = (
+        env.reward_range.shape[0] if isinstance(env.reward_range, gym.spaces.Box) else 1
+    )
+
     logger_kwargs["verbose_vars"] = (
         logger_kwargs["verbose_vars"]
         if (
@@ -917,19 +943,6 @@ def sac(  # noqa: C901
     }  # Retrieve hyperparameters (Ignore logger object)
     logger.save_config(hyper_paramet_dict)  # Write hyperparameters to logger
 
-    env = env_fn()
-    env = gym.wrappers.FlattenObservation(
-        env
-    )  # NOTE: Done to make sure the alg works with dict observation spaces
-    if num_test_episodes != 0:
-        test_env = env_fn()
-        test_env = gym.wrappers.FlattenObservation(test_env)
-    obs_dim = env.observation_space.shape
-    act_dim = env.action_space.shape[0]
-    rew_dim = (
-        env.reward_range.shape[0] if isinstance(env.reward_range, gym.spaces.Box) else 1
-    )
-
     # Retrieve max episode length
     if max_ep_len is None:
         max_ep_len = env.env._max_episode_steps
@@ -957,9 +970,6 @@ def sac(  # noqa: C901
         torch.manual_seed(seed)
         np.random.seed(seed)
         random.seed(seed)
-        env.seed(seed)
-        if num_test_episodes != 0:
-            test_env.seed(seed)
 
     policy = SAC(
         env,
@@ -1052,7 +1062,8 @@ def sac(  # noqa: C901
 
     # Main loop: collect experience in env and update/log each epoch
     start_time = time.time()
-    o, ep_ret, ep_len = env.reset(), 0, 0
+    o, _ = env.reset()
+    ep_ret, ep_len = 0, 0
     for t in range(total_steps):
         # Until start_steps have elapsed, randomly sample actions
         # from a uniform distribution for better exploration. Afterwards,
@@ -1063,24 +1074,20 @@ def sac(  # noqa: C901
             a = env.action_space.sample()
 
         # Take step in the env
-        o_, r, d, _ = env.step(a)
+        o_, r, d, truncated, _ = env.step(a)
         ep_ret += r
         ep_len += 1
 
-        # Ignore the "done" signal if it comes from hitting the time
-        # horizon (that is, when it's an artificial terminal signal
-        # that isn't based on the agent's state)
-        d = False if ep_len == max_ep_len else d
-
         replay_buffer.store(o, a, r, o_, d)
 
         # Make sure to update most recent observation!
         o = o_
 
         # End of trajectory handling
-        if d or (ep_len == max_ep_len):
+        if d or truncated:
             logger.store(EpRet=ep_ret, EpLen=ep_len)
-            o, ep_ret, ep_len = env.reset(), 0, 0
+            o, _ = env.reset()
+            ep_ret, ep_len = 0, 0
 
         # Update handling
         if (t + 1) >= update_after and ((t + 1) - update_after) % update_every == 0:
@@ -1219,18 +1226,15 @@ def sac(  # noqa: C901
 
 
 if __name__ == "__main__":
-    # NOTE: You can import your custom gym environment here.
-    # import stable_gym  # noqa: F401
-
     parser = argparse.ArgumentParser(
         description="Trains a SAC agent in a given environment."
     )
     parser.add_argument(
         "--env",
         type=str,
-        default="Oscillator-v1",
-        help="the gym env (default: Oscillator-v1)",
-    )
+        default="stable_gym:Oscillator-v1",
+        help="the gymnasium env (default: stable_gym:Oscillator-v1)",
+    )  # NOTE: Ensure the environment is installed in the current python environment.
     parser.add_argument(
         "--hid_a",
         type=int,