Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update howtos #281

Merged
merged 3 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions howto/add_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,19 @@ The main properties/methods that the environment has to provide are the followin
>
> All the observations returned by the `step` and `reset` functions must be python dictionary of numpy arrays.

## About observations and actions spaces

> [!NOTE]
>
> Please remember that any environment is considered independent of any other and it is supposed to interact with a single agent as Multi-Agent Reinforcement Learning (MARL) is not actually supported.

The current observations shapes supported are:

* 1D vector: everything that is a 1D vector will be processed by an MLP by the agent.
* 2D/3D images: everything that is not a 1D vector will be processed by a CNN by the agent. A 2D image or a 3D image of shape `[H,W,1]` or `[1,H,W]` will be considered as a grayscale image, a multi-channel image otherwise.

An action of type `gymnasium.spaces.Box` must be of shape `(n,)`, where `n` is the number of (possibly continuous) actions the environment supports.

# Add a new Environment
There are two ways to add a new environment:
1. Create from scratch a custom environment by inheriting from the [`gymnasium.Env`](https://gymnasium.farama.org/api/env/#gymnasium-env) class.
Expand Down
121 changes: 105 additions & 16 deletions howto/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,26 @@ This document explains how the configuration files and folders are structured. I
```tree
sheeprl/configs
├── algo
│ ├── a2c.yaml
│ ├── default.yaml
│ ├── dreamer_v1.yaml
│ ├── dreamer_v2.yaml
│ ├── dreamer_v3_L.yaml
│ ├── dreamer_v3_M.yaml
│ ├── dreamer_v3_S.yaml
│ ├── dreamer_v3_XL.yaml
│ ├── dreamer_v3_XS.yaml
│ ├── dreamer_v3.yaml
│ ├── droq.yaml
│ ├── p2e_dv1.yaml
│ ├── p2e_dv2.yaml
│ ├── ppo.yaml
│ ├── p2e_dv3.yaml
│ ├── ppo_decoupled.yaml
│ ├── ppo_recurrent.yaml
│ ├── sac.yaml
│ ├── ppo.yaml
│ ├── sac_ae.yaml
│ └── sac_decoupled.yaml
│ ├── sac_decoupled.yaml
│ └── sac.yaml
├── buffer
│ └── default.yaml
├── checkpoint
Expand All @@ -44,40 +51,86 @@ sheeprl/configs
│ ├── gym.yaml
│ ├── minecraft.yaml
│ ├── minedojo.yaml
│ └── minerl.yaml
│ ├── minerl_obtain_diamond.yaml
│ ├── minerl_obtain_iron_pickaxe.yaml
│ ├── minerl.yaml
│ ├── mujoco.yaml
│ └── super_mario_bros.yaml
├── env_config.yaml
├── eval_config.yaml
├── exp
│ ├── a2c_benchmarks.yaml
│ ├── a2c.yaml
│ ├── default.yaml
│ ├── dreamer_v1_benchmarks.yaml
│ ├── dreamer_v1.yaml
│ ├── dreamer_v2.yaml
│ ├── dreamer_v2_benchmarks.yaml
│ ├── dreamer_v2_crafter.yaml
│ ├── dreamer_v2_ms_pacman.yaml
│ ├── dreamer_v3.yaml
│ ├── dreamer_v2.yaml
│ ├── dreamer_v3_100k_boxing.yaml
│ ├── dreamer_v3_100k_ms_pacman.yaml
│ ├── dreamer_v3_L_doapp.yaml
│ ├── dreamer_v3_benchmarks.yaml
│ ├── dreamer_v3_dmc_cartpole_swingup_sparse.yaml
│ ├── dreamer_v3_dmc_walker_walk.yaml
│ ├── dreamer_v3_L_doapp_128px_gray_combo_discrete.yaml
│ ├── dreamer_v3_L_doapp.yaml
│ ├── dreamer_v3_L_navigate.yaml
│ ├── dreamer_v3_super_mario_bros.yaml
│ ├── dreamer_v3_XL_crafter.yaml
│ ├── dreamer_v3_dmc_walker_walk.yaml
│ ├── dreamer_v3.yaml
│ ├── droq.yaml
│ ├── p2e_dv1.yaml
│ ├── p2e_dv2.yaml
│ ├── ppo.yaml
│ ├── p2e_dv1_exploration.yaml
│ ├── p2e_dv1_finetuning.yaml
│ ├── p2e_dv2_exploration.yaml
│ ├── p2e_dv2_finetuning.yaml
│ ├── p2e_dv3_expl_L_doapp_128px_gray_combo_discrete_15Mexpl_20Mstps.yaml
│ ├── p2e_dv3_exploration.yaml
│ ├── p2e_dv3_finetuning.yaml
│ ├── p2e_dv3_fntn_L_doapp_64px_gray_combo_discrete_5Mstps.yaml
│ ├── ppo_benchmarks.yaml
│ ├── ppo_decoupled.yaml
│ ├── ppo_recurrent.yaml
│ ├── sac.yaml
│ ├── ppo_super_mario_bros.yaml
│ ├── ppo.yaml
│ ├── sac_ae.yaml
│ └── sac_decoupled.yaml
│ ├── sac_benchmarks.yaml
│ ├── sac_decoupled.yaml
│ └── sac.yaml
├── fabric
│ ├── ddp-cpu.yaml
│ ├── ddp-cuda.yaml
│ └── default.yaml
├── hydra
│ └── default.yaml
├── __init__.py
├── logger
│ ├── mlflow.yaml
│ └── tensorboard.yaml
├── metric
│ └── default.yaml
├── model_manager
│ ├── a2c.yaml
│ ├── default.yaml
│ ├── dreamer_v1.yaml
│ ├── dreamer_v2.yaml
│ ├── dreamer_v3.yaml
│ ├── droq.yaml
│ ├── p2e_dv1_exploration.yaml
│ ├── p2e_dv1_finetuning.yaml
│ ├── p2e_dv2_exploration.yaml
│ ├── p2e_dv2_finetuning.yaml
│ ├── p2e_dv3_exploration.yaml
│ ├── p2e_dv3_finetuning.yaml
│ ├── ppo_recurrent.yaml
│ ├── ppo.yaml
│ ├── sac_ae.yaml
│ └── sac.yaml
├── model_manager_config.yaml
└── optim
├── adam.yaml
├── rmsprop_tf.yaml
├── rmsprop.yaml
└── sgd.yaml
```

Expand All @@ -102,24 +155,56 @@ defaults:
- env: default.yaml
- fabric: default.yaml
- metric: default.yaml
- model_manager: default.yaml
- hydra: default.yaml
- exp: ???

num_threads: 1
float32_matmul_precision: "high"

# Set it to True to run a single optimization step
dry_run: False

# Reproducibility
seed: 42
torch_deterministic: False

# For more information about reproducibility in PyTorch, see https://pytorch.org/docs/stable/notes/randomness.html

# torch.use_deterministic_algorithms() lets you configure PyTorch to use deterministic algorithms
# instead of nondeterministic ones where available,
# and to throw an error if an operation is known to be nondeterministic (and without a deterministic alternative).
torch_use_deterministic_algorithms: False

# Disabling the benchmarking feature with torch.backends.cudnn.benchmark = False
# causes cuDNN to deterministically select an algorithm, possibly at the cost of reduced performance.
# However, if you do not need reproducibility across multiple executions of your application,
# then performance might improve if the benchmarking feature is enabled with torch.backends.cudnn.benchmark = True.
torch_backends_cudnn_benchmark: True

# While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run,
# that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True)
# or torch.backends.cudnn.deterministic = True is set.
# The latter setting controls only this behavior,
# unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too.
torch_backends_cudnn_deterministic: False

# From: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility
# By design, all cuBLAS API routines from a given toolkit version, generate the same bit-wise results at every run
# when executed on GPUs with the same architecture and the same number of SMs.
# However, bit-wise reproducibility is not guaranteed across toolkit versions
# because the implementation might differ due to some implementation changes.
# This guarantee holds when a single CUDA stream is active only.
# If multiple concurrent streams are active, the library may optimize total performance by picking different internal implementations.
cublas_workspace_config: null # Possible values are: ":4096:8" or ":16:8"

# Output folders
exp_name: "default"
exp_name: ${algo.name}_${env.id}
run_name: ${now:%Y-%m-%d_%H-%M-%S}_${exp_name}_${seed}
root_dir: ${algo.name}/${env.id}
```

By default we want the user to specify the experiment config, represented by `- exp: ???` in the above example. The three-question-marks symbol tells hydra to expect that an `exp` config is specified at runtime by the user (e.g. `sheeprl.py exp=dreamer_v3`: one can look at every exp configs in `sheeprl/config/exp/` folder).

### Algorithms

In the `algo` folder one can find all the configurations for every algorithm implemented in sheeprl. Those configs contain all the hyperparameters specific to a particular algorithm. Let us have a look at the `dreamer_v3.yaml` config for example:
Expand Down Expand Up @@ -427,9 +512,13 @@ Given this config, one can easily run an experiment to test the Dreamer-V3 algor
python sheeprl.py exp=dreamer_v3_100k_ms_pacman
```

> [!WARNING]
>
> The default hyperparameters specified in the configs gathered by the experiment config (in this example the hyperparameters specified by the `sheeprl/configs/exp/dreamer_v3.yaml`, `sheeprl/configs/env/atari.yaml` and all the configs coming with them) will be overwritten by the values in the current config whenever a naming collision happens, for example when the same field is defined in both configurations. Those naming collisions will be resolved by keeping the value defined in the current config. This behaviour is specified by letting the `_self_` keyword be the last one in the `defaults` list.

### Fabric

These configurations control the parameters to be passed to the [Fabric object](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.fabric.Fabric.html#lightning.fabric.fabric.Fabric). With those one can control whether to run the experiments on multiple devices, on which accelerator and with thich precision. For more information please have a look at the [Lightning documentation page](https://lightning.ai/docs/fabric/stable/api/fabric_args.html#).
These configurations control the parameters to be passed to the [Fabric object](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.fabric.Fabric.html#lightning.fabric.fabric.Fabric). With those one can control whether to run the experiments on multiple devices, on which accelerator and with which precision. For more information please have a look at the [Lightning documentation page](https://lightning.ai/docs/fabric/stable/api/fabric_args.html#).

### Hydra

Expand Down
4 changes: 4 additions & 0 deletions howto/learn_in_minedojo.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ Now, you can install the MineDojo environment:
pip install -e .[minedojo]
```

> [!WARNING]
>
> If you run into any problems during the installation due to some missing files that are not downloaded, please have a look at [this issue](https://github.com/MineDojo/MineDojo/issues/113).

## MineDojo environments
> [!NOTE]
>
Expand Down
4 changes: 4 additions & 0 deletions howto/learn_in_minerl.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ Now, you can install the MineRL environment:
pip install -e .[minerl]
```

> [!WARNING]
>
> If you run into any problems during the installation due to some missing files that are not downloaded, please have a look at [this issue](https://github.com/MineDojo/MineDojo/issues/113).

## MineRL environments
We have modified the MineRL environments to have a custom action and observation space. We provide three different tasks:
1. Navigate: you need to set the `env.id` argument to `custom_navigate`.
Expand Down
1 change: 1 addition & 0 deletions howto/register_external_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,7 @@ def build_agent(
for agent_p, player_p in zip(agent.critic.parameters(), player.critic.parameters()):
player_p.data = agent_p.data
return agent, player
```

## Loss functions

Expand Down
13 changes: 5 additions & 8 deletions howto/work_with_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ We start from the concept of *policy step*: a policy step is the particular step

> [!NOTE]
>
> The environment step is the step performed by the environment: the environment takes in input an action and computes the next observation and the next reward.
> The environment step is the step performed by the environment: the environment takes in input an action and computes the next observation and the next reward. This means that the environment steps are taking into consideration also the **action repeat**, which is a value greater or equal to 0 that specifies how many times an action has to be played (repeated by the environment) independently by the observations received.

Now that we have introduced the concept of *policy step*, it is necessary to clarify some aspects:

Expand All @@ -18,17 +18,14 @@ In general, if we have $n$ parallel processes, each one with $m$ independent env

The hyper-parameters that refer to the *policy steps* are:

* `total_steps`: the total number of policy steps to perform in an experiment. Effectively, this number will be divided in each process by $n \cdot m$ to obtain the number of training steps to be performed by each of them.
* `total_steps`: the total number of policy steps to perform in an experiment. Effectively, this number will be divided in each process by $n \cdot m$ to obtain the number of iteration steps to be performed by each of them.
* `exploration_steps`: the number of policy steps in which the agent explores the environment in the P2E algorithms.
* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `terminated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
* `learning_starts`: how many policy steps the agent has to perform before starting the training.
* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `truncated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
* `learning_starts`: how many policy steps the agent has to perform before starting the training. During the first `learning_starts` steps the buffer is pre-filled with random actions sampled by the environment.

## Gradient steps
A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * per_rank_gradient_steps` calls to the *train* method will be executed.

The hyper-parameters which refer to the *gradient steps* are:
* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.

> [!NOTE]
>
> The `replay_ratio` is the ratio between the gradient steps and the policy steps played by the agente.
* `algo.replay_ratio`: the `replay-ratio` is the ratio between the gradient steps and the policy steps played by the agent. The higher the replay-ratio the more sample-efficient the agent should be. The replay-ratio is a global hyper-parameters that affects only the off-policy algorithms like SAC or Dreamer and must be a float greater than zero. For example, a replay-ratio of 0.5 means that the agent will train itself for 1 gradient step every 2 policy steps. The **replay ratio does not account for both the environment's action-repeat and the `algo.learning_starts`**
Loading