diff --git a/howto/add_environment.md b/howto/add_environment.md index ecc1790b..3eb26287 100644 --- a/howto/add_environment.md +++ b/howto/add_environment.md @@ -14,6 +14,19 @@ The main properties/methods that the environment has to provide are the followin > > All the observations returned by the `step` and `reset` functions must be python dictionary of numpy arrays. +## About observations and actions spaces + +> [!NOTE] +> +> Please remember that any environment is considered independent of any other and it is supposed to interact with a single agent as Multi-Agent Reinforcement Learning (MARL) is not actually supported. + +The current observations shapes supported are: + +* 1D vector: everything that is a 1D vector will be processed by an MLP by the agent. +* 2D/3D images: everything that is not a 1D vector will be processed by a CNN by the agent. A 2D image or a 3D image of shape `[H,W,1]` or `[1,H,W]` will be considered as a grayscale image, a multi-channel image otherwise. + +An action of type `gymnasium.spaces.Box` must be of shape `(n,)`, where `n` is the number of (possibly continuous) actions the environment supports. + # Add a new Environment There are two ways to add a new environment: 1. Create from scratch a custom environment by inheriting from the [`gymnasium.Env`](https://gymnasium.farama.org/api/env/#gymnasium-env) class. diff --git a/howto/configs.md b/howto/configs.md index 6a6d096c..3c91fc5d 100644 --- a/howto/configs.md +++ b/howto/configs.md @@ -14,19 +14,26 @@ This document explains how the configuration files and folders are structured. I ```tree sheeprl/configs ├── algo +│ ├── a2c.yaml │ ├── default.yaml │ ├── dreamer_v1.yaml │ ├── dreamer_v2.yaml +│ ├── dreamer_v3_L.yaml +│ ├── dreamer_v3_M.yaml +│ ├── dreamer_v3_S.yaml +│ ├── dreamer_v3_XL.yaml +│ ├── dreamer_v3_XS.yaml │ ├── dreamer_v3.yaml │ ├── droq.yaml │ ├── p2e_dv1.yaml │ ├── p2e_dv2.yaml -│ ├── ppo.yaml +│ ├── p2e_dv3.yaml │ ├── ppo_decoupled.yaml │ ├── ppo_recurrent.yaml -│ ├── sac.yaml +│ ├── ppo.yaml │ ├── sac_ae.yaml -│ └── sac_decoupled.yaml +│ ├── sac_decoupled.yaml +│ └── sac.yaml ├── buffer │ └── default.yaml ├── checkpoint @@ -44,40 +51,86 @@ sheeprl/configs │ ├── gym.yaml │ ├── minecraft.yaml │ ├── minedojo.yaml -│ └── minerl.yaml +│ ├── minerl_obtain_diamond.yaml +│ ├── minerl_obtain_iron_pickaxe.yaml +│ ├── minerl.yaml +│ ├── mujoco.yaml +│ └── super_mario_bros.yaml ├── env_config.yaml +├── eval_config.yaml ├── exp +│ ├── a2c_benchmarks.yaml +│ ├── a2c.yaml │ ├── default.yaml +│ ├── dreamer_v1_benchmarks.yaml │ ├── dreamer_v1.yaml -│ ├── dreamer_v2.yaml +│ ├── dreamer_v2_benchmarks.yaml +│ ├── dreamer_v2_crafter.yaml │ ├── dreamer_v2_ms_pacman.yaml -│ ├── dreamer_v3.yaml +│ ├── dreamer_v2.yaml │ ├── dreamer_v3_100k_boxing.yaml │ ├── dreamer_v3_100k_ms_pacman.yaml -│ ├── dreamer_v3_L_doapp.yaml +│ ├── dreamer_v3_benchmarks.yaml +│ ├── dreamer_v3_dmc_cartpole_swingup_sparse.yaml +│ ├── dreamer_v3_dmc_walker_walk.yaml │ ├── dreamer_v3_L_doapp_128px_gray_combo_discrete.yaml +│ ├── dreamer_v3_L_doapp.yaml │ ├── dreamer_v3_L_navigate.yaml +│ ├── dreamer_v3_super_mario_bros.yaml │ ├── dreamer_v3_XL_crafter.yaml -│ ├── dreamer_v3_dmc_walker_walk.yaml +│ ├── dreamer_v3.yaml │ ├── droq.yaml -│ ├── p2e_dv1.yaml -│ ├── p2e_dv2.yaml -│ ├── ppo.yaml +│ ├── p2e_dv1_exploration.yaml +│ ├── p2e_dv1_finetuning.yaml +│ ├── p2e_dv2_exploration.yaml +│ ├── p2e_dv2_finetuning.yaml +│ ├── p2e_dv3_expl_L_doapp_128px_gray_combo_discrete_15Mexpl_20Mstps.yaml +│ ├── p2e_dv3_exploration.yaml +│ ├── p2e_dv3_finetuning.yaml +│ ├── p2e_dv3_fntn_L_doapp_64px_gray_combo_discrete_5Mstps.yaml +│ ├── ppo_benchmarks.yaml │ ├── ppo_decoupled.yaml │ ├── ppo_recurrent.yaml -│ ├── sac.yaml +│ ├── ppo_super_mario_bros.yaml +│ ├── ppo.yaml │ ├── sac_ae.yaml -│ └── sac_decoupled.yaml +│ ├── sac_benchmarks.yaml +│ ├── sac_decoupled.yaml +│ └── sac.yaml ├── fabric │ ├── ddp-cpu.yaml │ ├── ddp-cuda.yaml │ └── default.yaml ├── hydra │ └── default.yaml +├── __init__.py +├── logger +│ ├── mlflow.yaml +│ └── tensorboard.yaml ├── metric │ └── default.yaml +├── model_manager +│ ├── a2c.yaml +│ ├── default.yaml +│ ├── dreamer_v1.yaml +│ ├── dreamer_v2.yaml +│ ├── dreamer_v3.yaml +│ ├── droq.yaml +│ ├── p2e_dv1_exploration.yaml +│ ├── p2e_dv1_finetuning.yaml +│ ├── p2e_dv2_exploration.yaml +│ ├── p2e_dv2_finetuning.yaml +│ ├── p2e_dv3_exploration.yaml +│ ├── p2e_dv3_finetuning.yaml +│ ├── ppo_recurrent.yaml +│ ├── ppo.yaml +│ ├── sac_ae.yaml +│ └── sac.yaml +├── model_manager_config.yaml └── optim ├── adam.yaml + ├── rmsprop_tf.yaml + ├── rmsprop.yaml └── sgd.yaml ``` @@ -102,24 +155,56 @@ defaults: - env: default.yaml - fabric: default.yaml - metric: default.yaml + - model_manager: default.yaml - hydra: default.yaml - exp: ??? num_threads: 1 +float32_matmul_precision: "high" # Set it to True to run a single optimization step dry_run: False # Reproducibility seed: 42 -torch_deterministic: False + +# For more information about reproducibility in PyTorch, see https://pytorch.org/docs/stable/notes/randomness.html + +# torch.use_deterministic_algorithms() lets you configure PyTorch to use deterministic algorithms +# instead of nondeterministic ones where available, +# and to throw an error if an operation is known to be nondeterministic (and without a deterministic alternative). +torch_use_deterministic_algorithms: False + +# Disabling the benchmarking feature with torch.backends.cudnn.benchmark = False +# causes cuDNN to deterministically select an algorithm, possibly at the cost of reduced performance. +# However, if you do not need reproducibility across multiple executions of your application, +# then performance might improve if the benchmarking feature is enabled with torch.backends.cudnn.benchmark = True. +torch_backends_cudnn_benchmark: True + +# While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run, +# that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True) +# or torch.backends.cudnn.deterministic = True is set. +# The latter setting controls only this behavior, +# unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too. +torch_backends_cudnn_deterministic: False + +# From: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility +# By design, all cuBLAS API routines from a given toolkit version, generate the same bit-wise results at every run +# when executed on GPUs with the same architecture and the same number of SMs. +# However, bit-wise reproducibility is not guaranteed across toolkit versions +# because the implementation might differ due to some implementation changes. +# This guarantee holds when a single CUDA stream is active only. +# If multiple concurrent streams are active, the library may optimize total performance by picking different internal implementations. +cublas_workspace_config: null # Possible values are: ":4096:8" or ":16:8" # Output folders -exp_name: "default" +exp_name: ${algo.name}_${env.id} run_name: ${now:%Y-%m-%d_%H-%M-%S}_${exp_name}_${seed} root_dir: ${algo.name}/${env.id} ``` +By default we want the user to specify the experiment config, represented by `- exp: ???` in the above example. The three-question-marks symbol tells hydra to expect that an `exp` config is specified at runtime by the user (e.g. `sheeprl.py exp=dreamer_v3`: one can look at every exp configs in `sheeprl/config/exp/` folder). + ### Algorithms In the `algo` folder one can find all the configurations for every algorithm implemented in sheeprl. Those configs contain all the hyperparameters specific to a particular algorithm. Let us have a look at the `dreamer_v3.yaml` config for example: @@ -427,9 +512,13 @@ Given this config, one can easily run an experiment to test the Dreamer-V3 algor python sheeprl.py exp=dreamer_v3_100k_ms_pacman ``` +> [!WARNING] +> +> The default hyperparameters specified in the configs gathered by the experiment config (in this example the hyperparameters specified by the `sheeprl/configs/exp/dreamer_v3.yaml`, `sheeprl/configs/env/atari.yaml` and all the configs coming with them) will be overwritten by the values in the current config whenever a naming collision happens, for example when the same field is defined in both configurations. Those naming collisions will be resolved by keeping the value defined in the current config. This behaviour is specified by letting the `_self_` keyword be the last one in the `defaults` list. + ### Fabric -These configurations control the parameters to be passed to the [Fabric object](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.fabric.Fabric.html#lightning.fabric.fabric.Fabric). With those one can control whether to run the experiments on multiple devices, on which accelerator and with thich precision. For more information please have a look at the [Lightning documentation page](https://lightning.ai/docs/fabric/stable/api/fabric_args.html#). +These configurations control the parameters to be passed to the [Fabric object](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.fabric.Fabric.html#lightning.fabric.fabric.Fabric). With those one can control whether to run the experiments on multiple devices, on which accelerator and with which precision. For more information please have a look at the [Lightning documentation page](https://lightning.ai/docs/fabric/stable/api/fabric_args.html#). ### Hydra diff --git a/howto/learn_in_minedojo.md b/howto/learn_in_minedojo.md index 1eea5a3f..db5916e7 100644 --- a/howto/learn_in_minedojo.md +++ b/howto/learn_in_minedojo.md @@ -25,6 +25,10 @@ Now, you can install the MineDojo environment: pip install -e .[minedojo] ``` +> [!WARNING] +> +> If you run into any problems during the installation due to some missing files that are not downloaded, please have a look at [this issue](https://github.com/MineDojo/MineDojo/issues/113). + ## MineDojo environments > [!NOTE] > diff --git a/howto/learn_in_minerl.md b/howto/learn_in_minerl.md index 50d61a33..a259a84b 100644 --- a/howto/learn_in_minerl.md +++ b/howto/learn_in_minerl.md @@ -17,6 +17,10 @@ Now, you can install the MineRL environment: pip install -e .[minerl] ``` +> [!WARNING] +> +> If you run into any problems during the installation due to some missing files that are not downloaded, please have a look at [this issue](https://github.com/MineDojo/MineDojo/issues/113). + ## MineRL environments We have modified the MineRL environments to have a custom action and observation space. We provide three different tasks: 1. Navigate: you need to set the `env.id` argument to `custom_navigate`. diff --git a/howto/register_external_algorithm.md b/howto/register_external_algorithm.md index b01c7163..79db3d8a 100644 --- a/howto/register_external_algorithm.md +++ b/howto/register_external_algorithm.md @@ -387,6 +387,7 @@ def build_agent( for agent_p, player_p in zip(agent.critic.parameters(), player.critic.parameters()): player_p.data = agent_p.data return agent, player +``` ## Loss functions diff --git a/howto/work_with_steps.md b/howto/work_with_steps.md index 5bc62a8c..cdf87522 100644 --- a/howto/work_with_steps.md +++ b/howto/work_with_steps.md @@ -7,7 +7,7 @@ We start from the concept of *policy step*: a policy step is the particular step > [!NOTE] > -> The environment step is the step performed by the environment: the environment takes in input an action and computes the next observation and the next reward. +> The environment step is the step performed by the environment: the environment takes in input an action and computes the next observation and the next reward. This means that the environment steps are taking into consideration also the **action repeat**, which is a value greater or equal to 0 that specifies how many times an action has to be played (repeated by the environment) independently by the observations received. Now that we have introduced the concept of *policy step*, it is necessary to clarify some aspects: @@ -18,17 +18,14 @@ In general, if we have $n$ parallel processes, each one with $m$ independent env The hyper-parameters that refer to the *policy steps* are: -* `total_steps`: the total number of policy steps to perform in an experiment. Effectively, this number will be divided in each process by $n \cdot m$ to obtain the number of training steps to be performed by each of them. +* `total_steps`: the total number of policy steps to perform in an experiment. Effectively, this number will be divided in each process by $n \cdot m$ to obtain the number of iteration steps to be performed by each of them. * `exploration_steps`: the number of policy steps in which the agent explores the environment in the P2E algorithms. -* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `terminated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$. -* `learning_starts`: how many policy steps the agent has to perform before starting the training. +* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `truncated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$. +* `learning_starts`: how many policy steps the agent has to perform before starting the training. During the first `learning_starts` steps the buffer is pre-filled with random actions sampled by the environment. ## Gradient steps A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * per_rank_gradient_steps` calls to the *train* method will be executed. The hyper-parameters which refer to the *gradient steps* are: * `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration. - -> [!NOTE] -> -> The `replay_ratio` is the ratio between the gradient steps and the policy steps played by the agente. \ No newline at end of file +* `algo.replay_ratio`: the `replay-ratio` is the ratio between the gradient steps and the policy steps played by the agent. The higher the replay-ratio the more sample-efficient the agent should be. The replay-ratio is a global hyper-parameters that affects only the off-policy algorithms like SAC or Dreamer and must be a float greater than zero. For example, a replay-ratio of 0.5 means that the agent will train itself for 1 gradient step every 2 policy steps. The **replay ratio does not account for both the environment's action-repeat and the `algo.learning_starts`** \ No newline at end of file