Update howtos (#281)

* Update howtos * Remove unwanted warning * Add section title
Eclectic-Sheep · May 9, 2024 · bf1483f · bf1483f
1 parent 96040b1
commit bf1483f
Show file tree

Hide file tree

Showing 6 changed files with 132 additions and 24 deletions.
diff --git a/howto/add_environment.md b/howto/add_environment.md
@@ -14,6 +14,19 @@ The main properties/methods that the environment has to provide are the followin
 >
 > All the observations returned by the `step` and `reset` functions must be python dictionary of numpy arrays.
 
+## About observations and actions spaces
+
+> [!NOTE]
+>
+> Please remember that any environment is considered independent of any other and it is supposed to interact with a single agent as Multi-Agent Reinforcement Learning (MARL) is not actually supported.
+
+The current observations shapes supported are:
+
+* 1D vector: everything that is a 1D vector will be processed by an MLP by the agent.
+* 2D/3D images: everything that is not a 1D vector will be processed by a CNN by the agent. A 2D image or a 3D image of shape `[H,W,1]` or `[1,H,W]` will be considered as a grayscale image, a multi-channel image otherwise.
+
+An action of type `gymnasium.spaces.Box` must be of shape `(n,)`, where `n` is the number of (possibly continuous) actions the environment supports. 
+
 # Add a new Environment
 There are two ways to add a new environment:
 1. Create from scratch a custom environment by inheriting from the [`gymnasium.Env`](https://gymnasium.farama.org/api/env/#gymnasium-env) class.

diff --git a/howto/configs.md b/howto/configs.md
@@ -14,19 +14,26 @@ This document explains how the configuration files and folders are structured. I
 ```tree
 sheeprl/configs
 ├── algo
+│   ├── a2c.yaml
 │   ├── default.yaml
 │   ├── dreamer_v1.yaml
 │   ├── dreamer_v2.yaml
+│   ├── dreamer_v3_L.yaml
+│   ├── dreamer_v3_M.yaml
+│   ├── dreamer_v3_S.yaml
+│   ├── dreamer_v3_XL.yaml
+│   ├── dreamer_v3_XS.yaml
 │   ├── dreamer_v3.yaml
 │   ├── droq.yaml
 │   ├── p2e_dv1.yaml
 │   ├── p2e_dv2.yaml
-│   ├── ppo.yaml
+│   ├── p2e_dv3.yaml
 │   ├── ppo_decoupled.yaml
 │   ├── ppo_recurrent.yaml
-│   ├── sac.yaml
+│   ├── ppo.yaml
 │   ├── sac_ae.yaml
-│   └── sac_decoupled.yaml
+│   ├── sac_decoupled.yaml
+│   └── sac.yaml
 ├── buffer
 │   └── default.yaml
 ├── checkpoint
@@ -44,40 +51,86 @@ sheeprl/configs
 │   ├── gym.yaml
 │   ├── minecraft.yaml
 │   ├── minedojo.yaml
-│   └── minerl.yaml
+│   ├── minerl_obtain_diamond.yaml
+│   ├── minerl_obtain_iron_pickaxe.yaml
+│   ├── minerl.yaml
+│   ├── mujoco.yaml
+│   └── super_mario_bros.yaml
 ├── env_config.yaml
+├── eval_config.yaml
 ├── exp
+│   ├── a2c_benchmarks.yaml
+│   ├── a2c.yaml
 │   ├── default.yaml
+│   ├── dreamer_v1_benchmarks.yaml
 │   ├── dreamer_v1.yaml
-│   ├── dreamer_v2.yaml
+│   ├── dreamer_v2_benchmarks.yaml
+│   ├── dreamer_v2_crafter.yaml
 │   ├── dreamer_v2_ms_pacman.yaml
-│   ├── dreamer_v3.yaml
+│   ├── dreamer_v2.yaml
 │   ├── dreamer_v3_100k_boxing.yaml
 │   ├── dreamer_v3_100k_ms_pacman.yaml
-│   ├── dreamer_v3_L_doapp.yaml
+│   ├── dreamer_v3_benchmarks.yaml
+│   ├── dreamer_v3_dmc_cartpole_swingup_sparse.yaml
+│   ├── dreamer_v3_dmc_walker_walk.yaml
 │   ├── dreamer_v3_L_doapp_128px_gray_combo_discrete.yaml
+│   ├── dreamer_v3_L_doapp.yaml
 │   ├── dreamer_v3_L_navigate.yaml
+│   ├── dreamer_v3_super_mario_bros.yaml
 │   ├── dreamer_v3_XL_crafter.yaml
-│   ├── dreamer_v3_dmc_walker_walk.yaml
+│   ├── dreamer_v3.yaml
 │   ├── droq.yaml
-│   ├── p2e_dv1.yaml
-│   ├── p2e_dv2.yaml
-│   ├── ppo.yaml
+│   ├── p2e_dv1_exploration.yaml
+│   ├── p2e_dv1_finetuning.yaml
+│   ├── p2e_dv2_exploration.yaml
+│   ├── p2e_dv2_finetuning.yaml
+│   ├── p2e_dv3_expl_L_doapp_128px_gray_combo_discrete_15Mexpl_20Mstps.yaml
+│   ├── p2e_dv3_exploration.yaml
+│   ├── p2e_dv3_finetuning.yaml
+│   ├── p2e_dv3_fntn_L_doapp_64px_gray_combo_discrete_5Mstps.yaml
+│   ├── ppo_benchmarks.yaml
 │   ├── ppo_decoupled.yaml
 │   ├── ppo_recurrent.yaml
-│   ├── sac.yaml
+│   ├── ppo_super_mario_bros.yaml
+│   ├── ppo.yaml
 │   ├── sac_ae.yaml
-│   └── sac_decoupled.yaml
+│   ├── sac_benchmarks.yaml
+│   ├── sac_decoupled.yaml
+│   └── sac.yaml
 ├── fabric
 │   ├── ddp-cpu.yaml
 │   ├── ddp-cuda.yaml
 │   └── default.yaml
 ├── hydra
 │   └── default.yaml
+├── __init__.py
+├── logger
+│   ├── mlflow.yaml
+│   └── tensorboard.yaml
 ├── metric
 │   └── default.yaml
+├── model_manager
+│   ├── a2c.yaml
+│   ├── default.yaml
+│   ├── dreamer_v1.yaml
+│   ├── dreamer_v2.yaml
+│   ├── dreamer_v3.yaml
+│   ├── droq.yaml
+│   ├── p2e_dv1_exploration.yaml
+│   ├── p2e_dv1_finetuning.yaml
+│   ├── p2e_dv2_exploration.yaml
+│   ├── p2e_dv2_finetuning.yaml
+│   ├── p2e_dv3_exploration.yaml
+│   ├── p2e_dv3_finetuning.yaml
+│   ├── ppo_recurrent.yaml
+│   ├── ppo.yaml
+│   ├── sac_ae.yaml
+│   └── sac.yaml
+├── model_manager_config.yaml
 └── optim
     ├── adam.yaml
+    ├── rmsprop_tf.yaml
+    ├── rmsprop.yaml
     └── sgd.yaml
 ```
 
@@ -102,24 +155,56 @@ defaults:
   - env: default.yaml
   - fabric: default.yaml
   - metric: default.yaml
+  - model_manager: default.yaml
   - hydra: default.yaml
   - exp: ???
 
 num_threads: 1
+float32_matmul_precision: "high"
 
 # Set it to True to run a single optimization step
 dry_run: False
 
 # Reproducibility
 seed: 42
-torch_deterministic: False
+
+# For more information about reproducibility in PyTorch, see https://pytorch.org/docs/stable/notes/randomness.html
+
+# torch.use_deterministic_algorithms() lets you configure PyTorch to use deterministic algorithms
+# instead of nondeterministic ones where available,
+# and to throw an error if an operation is known to be nondeterministic (and without a deterministic alternative).
+torch_use_deterministic_algorithms: False
+
+# Disabling the benchmarking feature with torch.backends.cudnn.benchmark = False 
+# causes cuDNN to deterministically select an algorithm, possibly at the cost of reduced performance.
+# However, if you do not need reproducibility across multiple executions of your application, 
+# then performance might improve if the benchmarking feature is enabled with torch.backends.cudnn.benchmark = True.
+torch_backends_cudnn_benchmark: True
+
+# While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run,
+# that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True)
+# or torch.backends.cudnn.deterministic = True is set. 
+# The latter setting controls only this behavior, 
+# unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too.
+torch_backends_cudnn_deterministic: False
+
+# From: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility
+# By design, all cuBLAS API routines from a given toolkit version, generate the same bit-wise results at every run
+# when executed on GPUs with the same architecture and the same number of SMs.
+# However, bit-wise reproducibility is not guaranteed across toolkit versions
+# because the implementation might differ due to some implementation changes.
+# This guarantee holds when a single CUDA stream is active only. 
+# If multiple concurrent streams are active, the library may optimize total performance by picking different internal implementations.
+cublas_workspace_config: null  # Possible values are: ":4096:8" or ":16:8"
 
 # Output folders
-exp_name: "default"
+exp_name: ${algo.name}_${env.id}
 run_name: ${now:%Y-%m-%d_%H-%M-%S}_${exp_name}_${seed}
 root_dir: ${algo.name}/${env.id}
 ```
 
+By default we want the user to specify the experiment config, represented by `- exp: ???` in the above example. The three-question-marks symbol tells hydra to expect that an `exp` config is specified at runtime by the user (e.g. `sheeprl.py exp=dreamer_v3`: one can look at every exp configs in `sheeprl/config/exp/` folder).
+
 ### Algorithms
 
 In the `algo` folder one can find all the configurations for every algorithm implemented in sheeprl. Those configs contain all the hyperparameters specific to a particular algorithm. Let us have a look at the `dreamer_v3.yaml` config for example:
@@ -427,9 +512,13 @@ Given this config, one can easily run an experiment to test the Dreamer-V3 algor
 python sheeprl.py exp=dreamer_v3_100k_ms_pacman
 ```
 
+> [!WARNING]
+>
+> The default hyperparameters specified in the configs gathered by the experiment config (in this example the hyperparameters specified by the `sheeprl/configs/exp/dreamer_v3.yaml`, `sheeprl/configs/env/atari.yaml` and all the configs coming with them) will be overwritten by the values in the current config whenever a naming collision happens, for example when the same field is defined in both configurations. Those naming collisions will be resolved by keeping the value defined in the current config. This behaviour is specified by letting the `_self_` keyword be the last one in the `defaults` list.
+
 ### Fabric
 
-These configurations control the parameters to be passed to the [Fabric object](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.fabric.Fabric.html#lightning.fabric.fabric.Fabric). With those one can control whether to run the experiments on multiple devices, on which accelerator and with thich precision. For more information please have a look at the [Lightning documentation page](https://lightning.ai/docs/fabric/stable/api/fabric_args.html#).
+These configurations control the parameters to be passed to the [Fabric object](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.fabric.Fabric.html#lightning.fabric.fabric.Fabric). With those one can control whether to run the experiments on multiple devices, on which accelerator and with which precision. For more information please have a look at the [Lightning documentation page](https://lightning.ai/docs/fabric/stable/api/fabric_args.html#).
 
 ### Hydra
 

diff --git a/howto/learn_in_minedojo.md b/howto/learn_in_minedojo.md
@@ -25,6 +25,10 @@ Now, you can install the MineDojo environment:
 pip install -e .[minedojo]
 ```
 
+> [!WARNING]
+>
+> If you run into any problems during the installation due to some missing files that are not downloaded, please have a look at [this issue](https://github.com/MineDojo/MineDojo/issues/113).
+
 ## MineDojo environments
 > [!NOTE]
 >

diff --git a/howto/learn_in_minerl.md b/howto/learn_in_minerl.md
@@ -17,6 +17,10 @@ Now, you can install the MineRL environment:
 pip install -e .[minerl]
 ```
 
+> [!WARNING]
+>
+> If you run into any problems during the installation due to some missing files that are not downloaded, please have a look at [this issue](https://github.com/MineDojo/MineDojo/issues/113).
+
 ## MineRL environments
 We have modified the MineRL environments to have a custom action and observation space. We provide three different tasks:
 1. Navigate: you need to set the `env.id` argument to `custom_navigate`.

diff --git a/howto/register_external_algorithm.md b/howto/register_external_algorithm.md
@@ -387,6 +387,7 @@ def build_agent(
     for agent_p, player_p in zip(agent.critic.parameters(), player.critic.parameters()):
         player_p.data = agent_p.data
     return agent, player
+```
 
 ## Loss functions
 

diff --git a/howto/work_with_steps.md b/howto/work_with_steps.md
@@ -7,7 +7,7 @@ We start from the concept of *policy step*: a policy step is the particular step
 
 > [!NOTE]
 >
-> The environment step is the step performed by the environment: the environment takes in input an action and computes the next observation and the next reward.
+> The environment step is the step performed by the environment: the environment takes in input an action and computes the next observation and the next reward. This means that the environment steps are taking into consideration also the **action repeat**, which is a value greater or equal to 0 that specifies how many times an action has to be played (repeated by the environment) independently by the observations received. 
 
 Now that we have introduced the concept of *policy step*, it is necessary to clarify some aspects:
 
@@ -18,17 +18,14 @@ In general, if we have $n$ parallel processes, each one with $m$ independent env
 
 The hyper-parameters that refer to the *policy steps* are:
 
-* `total_steps`: the total number of policy steps to perform in an experiment. Effectively, this number will be divided in each process by $n \cdot m$ to obtain the number of training steps to be performed by each of them.
+* `total_steps`: the total number of policy steps to perform in an experiment. Effectively, this number will be divided in each process by $n \cdot m$ to obtain the number of iteration steps to be performed by each of them.
 * `exploration_steps`: the number of policy steps in which the agent explores the environment in the P2E algorithms.
-* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `terminated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
-* `learning_starts`: how many policy steps the agent has to perform before starting the training.
+* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `truncated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
+* `learning_starts`: how many policy steps the agent has to perform before starting the training. During the first `learning_starts` steps the buffer is pre-filled with random actions sampled by the environment.
 
 ## Gradient steps
 A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * per_rank_gradient_steps` calls to the *train* method will be executed.
 
 The hyper-parameters which refer to the *gradient steps* are:
 * `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.
-
-> [!NOTE]
->
-> The `replay_ratio` is the ratio between the gradient steps and the policy steps played by the agente.
+* `algo.replay_ratio`: the `replay-ratio` is the ratio between the gradient steps and the policy steps played by the agent. The higher the replay-ratio the more sample-efficient the agent should be. The replay-ratio is a global hyper-parameters that affects only the off-policy algorithms like SAC or Dreamer and must be a float greater than zero. For example, a replay-ratio of 0.5 means that the agent will train itself for 1 gradient step every 2 policy steps. The **replay ratio does not account for both the environment's action-repeat and the `algo.learning_starts`**