Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mlflow #159

Merged
merged 39 commits into from
Nov 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
a222557
feat: added mlflow logger
michele-milesi Nov 13, 2023
febd2e8
feat: unified get_logger methods
michele-milesi Nov 13, 2023
dfeeb3f
feat: generalized model register
michele-milesi Nov 15, 2023
c8f9693
feat: removed signature
michele-milesi Nov 15, 2023
8806111
feat: added mlflow register model to sac, sac_decoupled and droq
michele-milesi Nov 15, 2023
e5dfc68
feat: added model manager to dreamers and sac_ae
michele-milesi Nov 15, 2023
7db942a
feat: added model manager to p2e algorithms
michele-milesi Nov 15, 2023
e778a9f
fix: removed order dependencies between configs and code when registe…
michele-milesi Nov 17, 2023
7b21ff4
fix: avoid p2e exploration models registered during finetuning
michele-milesi Nov 17, 2023
963e14d
Feature/add build agents (#153)
michele-milesi Nov 20, 2023
ef6644c
Merge branch 'main' of github.com:Eclectic-Sheep/sheeprl into feature…
michele-milesi Nov 20, 2023
32ff1e5
feat: split model manager configs
michele-milesi Nov 20, 2023
d0c69a4
feat: added script to register models from checkpoints
michele-milesi Nov 20, 2023
4c2ed7e
fix: bugs
michele-milesi Nov 20, 2023
25139e9
fix: configs
michele-milesi Nov 20, 2023
49612a7
fix: configs + registration model script
michele-milesi Nov 21, 2023
fe413f4
feat: added ensembles creation to build agent function (#154)
michele-milesi Nov 21, 2023
6f390ff
feat: added possibility to select experiment and run where to upload …
michele-milesi Nov 21, 2023
13f4c86
fix: bugs
michele-milesi Nov 21, 2023
598902b
feat: added configs to artifact when model is registered from checkpoint
michele-milesi Nov 22, 2023
5200466
docs: update logs_and_checkpoints how to
michele-milesi Nov 22, 2023
d002e0a
Merge branch 'main' of github.com:Eclectic-Sheep/sheeprl into feature…
michele-milesi Nov 22, 2023
61b14d9
Merge branch 'main' of github.com:Eclectic-Sheep/sheeprl into feature…
michele-milesi Nov 22, 2023
dff511a
feat: added model_manager howto
michele-milesi Nov 22, 2023
d8a0c94
docs: update
michele-milesi Nov 22, 2023
4bcc43a
docs: update
michele-milesi Nov 22, 2023
80a6443
fix: added 'from __future__ import annotations'
michele-milesi Nov 23, 2023
095c31c
Merge branch 'main' of github.com:Eclectic-Sheep/sheeprl into feature…
michele-milesi Nov 23, 2023
db9dee1
feat: added mlflow model manager tutorial in examples
michele-milesi Nov 24, 2023
3da0595
fix: bugs
michele-milesi Nov 24, 2023
c22dab0
merge: main into feature/mlflow
michele-milesi Nov 24, 2023
bb04242
fix: access to cnn and mlp keys
michele-milesi Nov 24, 2023
c88f345
fix: experiment and run names
michele-milesi Nov 27, 2023
4b857e4
fix: bugs
michele-milesi Nov 27, 2023
2fca27d
feat: MlflowModelManager.register_best_models() function
michele-milesi Nov 27, 2023
4de3801
fix: p2e build_agent
michele-milesi Nov 27, 2023
75724e7
docs: update
michele-milesi Nov 28, 2023
2f9e6af
fix: mlflow model manager
michele-milesi Nov 28, 2023
8816232
fix: mlflow model manager register best models
michele-milesi Nov 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,4 +167,7 @@ pytest_*
!sheeprl/configs/env
.diambra*
.hydra
.pypirc
.pypirc
mlruns
mlartifacts
examples/models
978 changes: 978 additions & 0 deletions examples/model_manager.ipynb

Large diffs are not rendered by default.

63 changes: 63 additions & 0 deletions howto/logs_and_checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ By default the logging of metrics is enabled with the following settings:
```yaml
# ./sheeprl/configs/metric/default.yaml

defaults:
- _self_
- /logger@logger: tensorboard

log_every: 5000
disable_timer: False

Expand All @@ -33,6 +37,7 @@ aggregator:
```
where
* `logger` is the configuration of the logger you want to use for logging. There are two possible values: `tensorboard` (default) and `mlflow`, but one can define and choose its own logger.
* `log_every` is the number of policy steps (number of steps played in the environment, e.g. if one has 2 processes with 4 environments per process then the policy steps are 2*4=8) between two consecutive logging operations. For more info about the policy steps, check the [Work with Steps Tutorial](./work_with_steps.md).
* `disable_timer` is a boolean flag that enables/disables the timer to measure both the time spent in the environment and the time spent during the agent training. The timer class used can be found [here](../sheeprl/utils/timer.py).
* `log_level` is the level of logging: $0$ means no log (it disables also the timer), whereas $1$ means logging everything.
Expand All @@ -41,6 +46,64 @@ where

So, if one wants to disable everything related to logging, he/she can set `log_level` to $0$ if one wants to disable the timer, he/she can set `disable_timer` to `True`.

### Loggers
Two loggers are made available: the Tensorboard logger and the MLFlow one. In any case, it is possible to define or choose another logger.
The configurations of the loggers are under the `./sheeprl/configs/logger/` folder.

#### Tensorboard
Let us start with the Tensorboard logger, which is the default logger used in SheepRL.

```yaml
# ./sheeprl/configs/logger/tensorboard.yaml
# For more information, check https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.loggers.TensorBoardLogger.html
_target_: lightning.fabric.loggers.TensorBoardLogger
name: ${run_name}
root_dir: logs/runs/${root_dir}
version: null
default_hp_metric: True
prefix: ""
sub_dir: null
```
As shown in the configurations, it is necessary to specify the `_target_` class to instantiate. For the Tensorboard logger, it is necessary to specify the `name` and the `root_dir` arguments equal to the `run_name` and `logs/runs/<root_dir>` parameters, respectively, because we want that all the logs and files (configs, checkpoint, videos, ...) are under the same folder for a specific experiment.

> **Note**
>
> In general we want the path of the logs files to be in the same folder created by Hydra when the experiment is launched, so make sure to properly define the `root_dir` and `name` parameters of the logger so that it is within the folder created by hydra (defined by the `hydra.run.dir` parameter). The tensorboard logger will save the logs in the `<root_dir>/<name>/<version>/<sub_dir>/` folder (if `sub_dir` is defined, otherwise in the `<root_dir>/<name>/<version>/` folder).

The documentation of the TensorboardLogger class can be found [here](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.loggers.TensorBoardLogger.html).

#### MLFlow
Another possibility provided by SheepRL is [MLFlow](https://mlflow.org/docs/2.8.0/index.html).

```yaml
# ./sheeprl/configs/logger/mlflow.yaml
# For more information, check https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html#lightning.pytorch.loggers.mlflow.MLFlowLogger
_target_: lightning.pytorch.loggers.MLFlowLogger
experiment_name: ${exp_name}
tracking_uri: ${oc.env:MLFLOW_TRACKING_URI}
run_name: ${algo.name}_${env.id}_${now:%Y-%m-%d_%H-%M-%S}
tags: null
save_dir: null
prefix: ""
artifact_location: null
run_id: null
log_model: false
```

The parameters that can be specified for creating the MLFlow logger are explained [here](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html#lightning.pytorch.loggers.mlflow.MLFlowLogger).

You can specify the MLFlow logger instead of the Tensorboard one in the CLI, by adding the `[email protected]=mlflow` argument. In this way, hydra will take the configurations defined in the `./sheeprl/configs/logger/mlflow.yaml` file.

```bash
python sheeprl.py exp=ppo exp_name=ppo-cartpole [email protected]=mlflow
```

> **Note**
>
> If you are using an MLFlow server, you can specify the `tracking_uri` in the config file or with the `MLFLOW_TRACKING_URI` environment variable (that is the default value in the configs).

### Logged metrics

Every algorithm should specify a set of default metrics to log, called `AGGREGATOR_KEYS`, under its own `utils.py` file. For instance, the default metrics logged by DreamerV2 are the following:
Expand Down
103 changes: 103 additions & 0 deletions howto/model_manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Model Manager

SheepRL makes it possible to register trained models on MLFLow, so as to keep track of model versions and stages.

## Register models with training
The configurations of the model manager are placed in the `./sheeprl/configs/model_manager/` folder, and the default configuration is defined as follows:
```yaml
# ./sheeprl/configs/model_manager/default.yaml

disabled: True
models: {}
```
Since the algorithms have different models, then the `models` parameter is set to an empty python dictionary, and each agent will define its own configuration. The `disabled` parameter indicates whether or not the user wants to register the agent when the training is finished (`False` means that the agent will be registered, otherwise not).

> **Note**
>
> The model manager can be used even if the chosen logger is Tensorboard, the only requirement is that an instance of the MLFlow server is running and is accessible, moreover, it is necessary to specify its URI in the `MLFLOW_TRACKING_URI` environment variable.

To better understand how to define the configurations of the models you want to register, take a look at the DreamerV3 model manager configuration:
```yaml
# ./sheeprl/configs/model_manager/dreamer_v3.yaml
defaults:
- default
- _self_
models:
world_model:
model_name: "${exp_name}_world_model"
description: "DreamerV3 World Model used in ${env.id} Environment"
tags: {}
actor:
model_name: "${exp_name}_actor"
description: "DreamerV3 Actor used in ${env.id} Environment"
tags: {}
critic:
model_name: "${exp_name}_critic"
description: "DreamerV3 Critic used in ${env.id} Environment"
tags: {}
target_critic:
model_name: "${exp_name}_target_critic"
description: "DreamerV3 Target Critic used in ${env.id} Environment"
tags: {}
moments:
model_name: "${exp_name}_moments"
description: "DreamerV3 Moments used in ${env.id} Environment"
tags: {}
```
For each model, it is necessary to define the `model_name`, the `description`, and the `tags` (i.e., a python dictionary with strings as keys and values). The keys that can be specified are defined by the `MODELS_TO_REGISTER` variable in the `./sheeprl/algos/<algo_name>/utils.py`. For DreamerV3, it is defined as follows: `MODELS_TO_REGISTER = {"world_model", "actor", "critic", "target_critic", "moments"}`.
If you do not want to log some models, then, you just need to remove it from the configuration file.

> **Note**
>
> The name of the models in the `MODELS_TO_REGISTER` variable is equal to the name of the variables of the models in the `./sheeprl/algos/<algo_name>/<algo_name>.py` file.
>
> Make sure that the models specified in the configuration file are a subset of the models defined by the `MODELS_TO_REGISTER` variable.

## Register models from checkpoints
Another possibility is to register the models after the training, by manually selecting the checkpoint where to retrieve the agent. To do this, it is possible to run the `sheeprl_model_manager.py` script by properly specifying the `checkpoint_path`, the `model_manager`, and the MLFlow-related configurations.
The default configurations are defined in the `./sheeprl/configs/model_manager_config.yaml` file, that is reported below:
```yaml
# ./sheeprl/configs/model_manager_config.yaml
# @package _global_
defaults:
- _self_
- model_manager: ???
- override hydra/hydra_logging: disabled
- override hydra/job_logging: disabled
hydra:
output_subdir: null
run:
dir: .
checkpoint_path: ???
run:
id: null
name: ${now:%Y-%m-%d_%H-%M-%S}_${exp_name}
experiment:
id: null
name: ${exp_name}_${now:%Y-%m-%d_%H-%M-%S}
tracking_uri: ${oc.env:MLFLOW_TRACKING_URI}
```

As before, it is necessary to specify the `model_manager` configurations (the models we want to register with names, descriptions, and tags). Moreover, it is mandatory to set the `checkpoint_path`, which must be the path to the `ckpt` file created during the training. Finally, the `run` and `experiment` parameters contain the MLFlow configurations:
* If you set the `run.id` to a value different from `null`, then all the other parameters are ignored, indeed, the models will be logged and registered under the run with the specified ID.
* If you want to create a new run (with a name equal to `run.name`) and put it into an existing experiment, then you have to set `run.id=null` and `experiment.id=<experiment_id>`.
* If you set `experiment.id=null` and `run.id=null`, then a new experiment and a new run are created with the specified names.

> **Note**
>
> Also, in this case, the models specified in the `model_manager` configuration must be a subset of the `MODELS_TO_REGISTER` variable.

For instance, you can register the DreamerV3 models from a checkpoint with the following command:

```bash
python sheeprl_model_manager.py model_manager=dreamer_v3 checkpoint_path=/path/to/checkpoint.ckpt
```

## Delete, Transition and Download Models
The MLFlow model manager enables the deletion of the registered models, moving them from one stage to another or downloading them.
[This notebook](../examples/model_manager.ipynb) contains a tutorial on how to use the MLFlow model manager. We recommend taking a look to see what APIs the model manager makes available.
Loading