Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint doc #1445

Merged
merged 4 commits into from
Sep 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 20 additions & 13 deletions documentation/source/Checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,22 +90,28 @@ trainer.train(model=model, training_params=train_params, train_loader=train_data
Then at the end of the training, our `ckpt_root_dir` contents will look similar to the following:

```
my_checkpoints_folder
├─── my_resnet18_training_experiment
│ ├── RUN_20230802_131052_651906
│ │ ├─ ckpt_best.pth # Model checkpoint on best epoch
│ │ ├─ ckpt_latest.pth # Model checkpoint on last epoch
│ │ ├─ average_model.pth # Model checkpoint averaged over epochs
│ │ ├─ ckpt_epoch_10.pth # Model checkpoint of epoch 10
│ │ ├─ ckpt_epoch_15.pth # Model checkpoint of epoch 15
│ │ ├─ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
│ │ └─ log_Aug02_13_10_52.txt # Trainer logs of a specific run
<ckpt_root_dir>
├── <experiment_name>
│ │
│ ├─── <run_dir>
│ │ ├─ ckpt_best.pth # Best performance during validation
│ │ ├─ ckpt_latest.pth # End of the most recent epoch
│ │ ├─ average_model.pth # Averaged over specified epochs
│ │ ├─ ckpt_epoch_*.pth # Checkpoints from specific epochs (like epoch 10, 15, etc.)
│ │ ├─ events.out.tfevents.* # Tensorflow run artifacts
│ │ └─ log_<timestamp>.txt # Trainer logs of the specific run
│ │
│ └─ RUN_20230803_121652_243212
│ └─── <other_run_dir>
│ └─ ...
└─── some_other_training_experiment_name
...
└─── <other_experiment_name>
├─── <run_dir>
│ └─ ...
└─── <another_run_dir>
└─ ...
```

Suppose we wish to load the weights from `ckpt_best.pth`. We can simply pass its path to the `checkpoint_path` argument in `models.get(...)`:
Expand All @@ -129,6 +135,7 @@ from super_gradients.training.utils.checkpoint_utils import load_checkpoint_to_m
model = models.get(model_name=Models.RESNET18, num_classes=10)
load_checkpoint_to_model(net=model, ckpt_local_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/RUN_20230802_131052_651906/ckpt_best.pth")
```

### Extending the Functionality of PyTorch's `strict` Parameter in `load_state_dict()`

When not familiar with PyTorch's `strict` parameter in `load_state_dict()`, please see [PyTorch's docs on this matter](https://pytorch.org/tutorials/beginner/saving_loading_models.html#id4) first.
Expand Down
72 changes: 47 additions & 25 deletions documentation/source/Example_Classification.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Training a classification model and transfer learning
# Training a Classification Model and Transfer Learning

In this example we will use SuperGradients to train from scratch a ResNet18 model on the CIFAR10 image classification
dataset. We will also fine-tune the same model via transfer learning with weights pre-trained on the ImageNet dataset.
Expand All @@ -14,41 +14,63 @@ pip install super-gradients

## 1. Experiment setup

First, we will initialize our trainer, which is a SuperGradients Trainer object.
First, we will initialize the `Trainer`. It handles:
- Model training
- Evaluating test data
- Making predictions
- Saving and managing checkpoints

```

To initialize it, you need:

- **Experiment Name:** A unique identifier for your training experiment.
- **Checkpoint Root Directory (`ckpt_root_dir`):** The directory where checkpoints, logs, and tensorboards are saved. While optional, if unspecified, it assumes the presence of a 'checkpoints' directory in your project's root.

```python
from super_gradients import Trainer
```

The trainer is in charge of training the model, evaluating test data, making predictions, and saving checkpoints.
experiment_name = "resnet18_cifar10_example"
CHECKPOINT_DIR = '/path/to/checkpoints/root/dir'

trainer = Trainer(experiment_name=experiment_name, ckpt_root_dir=CHECKPOINT_DIR)
```

To initialize the trainer, an experiment name must be provided. We will also provide a checkpoints root directory via
the `ckpt_root_dir` parameter. In this directory, all the experiment's logs, tensorboards, and checkpoints directories
will reside. This parameter is optional, and if not provided, it is assumed that a 'checkpoints' directory exists in
the project's path.
### 2. Understanding the Checkpoint Structure

A directory with the experiment's name will be created as a subdirectory of `ckpt_root_dir` as follows:
Checkpoints are crucial for progressive training, debugging, and model deployment. SuperGradients organizes them in a structured manner. Here's what the directory hierarchy looks like under your specified `ckpt_root_dir`:

```
ckpt_root_dir
|─── experiment_name_1
│ ckpt_best.pth # Model checkpoint on best epoch
│ ckpt_latest.pth # Model checkpoint on last epoch
│ average_model.pth # Model checkpoint averaged over epochs
│ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
│ log_Aug07_11_52_48.txt # Trainer logs of a specific run
└─── experiment_name_2
...
<ckpt_root_dir>
├── <experiment_name>
│ │
│ ├─── <run_dir>
│ │ ├─ ckpt_best.pth # Best performance during validation
│ │ ├─ ckpt_latest.pth # End of the most recent epoch
│ │ ├─ average_model.pth # Averaged over specified epochs
│ │ ├─ ckpt_epoch_*.pth # Checkpoints from specific epochs (like epoch 10, 15, etc.)
│ │ ├─ events.out.tfevents.* # Tensorflow run artifacts
│ │ └─ log_<timestamp>.txt # Trainer logs of the specific run
│ │
│ └─── <other_run_dir>
│ └─ ...
└─── <other_experiment_name>
├─── <run_dir>
│ └─ ...
└─── <another_run_dir>
└─ ...
```

We initialize the trainer as follows:
In this structure:

```
experiment_name = "resnet18_cifar10_example"
CHECKPOINT_DIR = '/path/to/checkpoints/root/dir'
- `ckpt_best.pth`: Saved whenever there's an improvement in the specified validation metric.
- `ckpt_latest.pth`: Updated at the end of every epoch.
- `average_model.pth`: Averaged checkpoint, created if `average_best_models` parameter is set to `True`.

trainer = Trainer(experiment_name=experiment_name, ckpt_root_dir=CHECKPOINT_DIR)
```
> For more information, check out the [dedicated page](.Checkpoints.md).

## 2. Dataset and dataloaders

Expand Down
83 changes: 58 additions & 25 deletions documentation/source/Example_Training-an-external-model.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Training an external model

In this example we will use SuperGradients to train a deep learning segmentation model to extract human portraits from
images, i.e., to remove the background from the image. We will show how SuperGradients allows seamless integration of
In this example we will use SuperGradients to train a deep learning segmentation model to extract human portraits from
images, i.e., to remove the background from the image.

We will show how SuperGradients allows seamless integration of
an external model, dataset, loss function, and metric into the training pipeline.

## Quick installation
Expand Down Expand Up @@ -561,28 +563,18 @@ Our custom metric is now ready to use with our training pipeline.

## 5. Experiment configuration

We now have the implementation of all external components we wish to incorporate into our training
pipeline. Let's put it all together.
### Trainer
First, we will initialize the `Trainer`. It handles:
- Model training
- Evaluating test data
- Making predictions
- Saving and managing checkpoints

First, we will initialize our trainer, which is in charge of training the model, evaluating test data, making
predictions, and saving checkpoints. To initialize the trainer, we provide an experiment name, and a checkpoints root
directory via the `ckpt_root_dir` parameter. In this directory, all of the experiment's logs, tensorboards, and
checkpoint directories will reside. A directory with the experiment's name will be created as a subdirectory of
`ckpt_root_dir` as follows:

```
ckpt_root_dir
|─── experiment_name_1
│ ckpt_best.pth # Model checkpoint on best epoch
│ ckpt_latest.pth # Model checkpoint on last epoch
│ average_model.pth # Model checkpoint averaged over epochs
│ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
│ log_Aug07_11_52_48.txt # Trainer logs of a specific run
└─── experiment_name_2
...
```
To initialize it, you need:

We initialize the trainer as follows:
- **Experiment Name:** A unique identifier for your training experiment.
- **Checkpoint Root Directory (`ckpt_root_dir`):** The directory where checkpoints, logs, and tensorboards are saved. While optional, if unspecified, it assumes the presence of a 'checkpoints' directory in your project's root.

```python
from super_gradients import Trainer
Expand All @@ -593,6 +585,45 @@ CHECKPOINT_DIR = '/path/to/checkpoints/root/dir'
trainer = Trainer(experiment_name=experiment_name, ckpt_root_dir=CHECKPOINT_DIR)
```

### Understanding the Checkpoint Structure

Checkpoints are crucial for progressive training, debugging, and model deployment. SuperGradients organizes them in a structured manner. Here's what the directory hierarchy looks like under your specified `ckpt_root_dir`:

```
<ckpt_root_dir>
├── <experiment_name>
│ │
│ ├─── <run_dir>
│ │ ├─ ckpt_best.pth # Best performance during validation
│ │ ├─ ckpt_latest.pth # End of the most recent epoch
│ │ ├─ average_model.pth # Averaged over specified epochs
│ │ ├─ ckpt_epoch_*.pth # Checkpoints from specific epochs (like epoch 10, 15, etc.)
│ │ ├─ events.out.tfevents.* # Tensorflow run artifacts
│ │ └─ log_<timestamp>.txt # Trainer logs of the specific run
│ │
│ └─── <other_run_dir>
│ └─ ...
└─── <other_experiment_name>
├─── <run_dir>
│ └─ ...
└─── <another_run_dir>
└─ ...
```

In this structure:

- `ckpt_best.pth`: Saved whenever there's an improvement in the specified validation metric.
- `ckpt_latest.pth`: Updated at the end of every epoch.
- `average_model.pth`: Averaged checkpoint, created if `average_best_models` parameter is set to `True`.

> For more information, check out the [dedicated page](.Checkpoints.md).

### Dataloaders

Next, we initialize the PyTorch dataloaders for our datasets:

```python
Expand All @@ -602,6 +633,8 @@ train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_wo
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False, num_workers=2)
```

### Training Hyperparameters

And lastly, we need to define the training hyperparameters:

```python
Expand Down Expand Up @@ -635,9 +668,9 @@ The above code shows the simplicity of integrating external, user-defined compon
pipeline. We simply plugged instantiations of our custom loss and metric into the hyperparameters dictionary,
and we are ready to go.

## 5. Training
## 6. Training

### 5.A. Training the model
### 6.A. Training the model

We are all set to start training our model. Simply plug in the model, training and validation dataloaders,
and training parameters into the trainer's `train()` function:
Expand Down Expand Up @@ -696,7 +729,7 @@ SUMMARY OF EPOCH 5
At the end of each epoch, the different logs and checkpoints are saved in the path defined by `ckpt_root_dir` and
`experiment_name`. Let's see how we can use Tensorboard to track training process.

### 5.B. Tensorboard logs
### 6.B. Tensorboard logs

To view the experiment's tensorboard logs, type the following command in the terminal from the
experiment's path:
Expand All @@ -719,7 +752,7 @@ We can also check the validation set's IoU metric's value:



## 6. Predictions with the trained model
## 7. Predictions with the trained model

Now that we have a trained model we can use it to make predictions on the test set. First, let's instantiate a test
dataset:
Expand Down
33 changes: 17 additions & 16 deletions documentation/source/logs.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,31 @@
# Local Logging

SuperGradients automatically logs locally multiple files that can help you explore your experiments results. This includes 1 tensorboard and 3 .txt files.
SuperGradients automatically logs multiple files locally that can help you explore your experiments results.
This includes 1 tensorboard and 3 .txt files.
Absolutely. I understand your requirements. Here's a more concise and structured introduction:

### Directory Structure Overview:
- **ckpt_root_dir**: The root directory where all experiments are stored.
- **experiment_name**: The specific folder dedicated to your current experiment.
- **run_dir**: Unique identifier for each training run; contains all associated checkpoints and logs.

> For a deeper dive into checkpoints, visit our [detailed guide](Checkpoints.md).

## I. Tensorboard logging
To easily keep track of your experiments, SuperGradients saves your results in `events.out.tfevents` format that can be used by tensorboard.

**What does it include?** This tensorboard includes all of your training and validation metrics but also other information such as learning rate, system metrics (CPU, GPU, ...), and more.

**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/events.out.tfevents.<unique_id>`

**How to launch?**`tensorboard --logdir checkpoint_path/events.out.tfevents.<unique_id>`

**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/<run_dir>/events.out.tfevents.<unique_id>`

**How to launch?** `tensorboard --logdir <ckpt_root_dir>/<experiment_name>/<run_dir>`

## II. Experiment logging
In case you cannot launch a tensorboard instance, you can still find a summary of your experiment saved in a readable .txt format.

**What does it include?** The experiment configuration and training/validation metrics.

**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/experiment_logs_<date>.txt`



**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/<run_dir>/experiment_logs_<date>.txt`

## III. Console logging
For better debugging and understanding of past runs, SuperGradients gathers all the print statements and logs into a
Expand All @@ -33,7 +35,7 @@ local file, providing you the convenience to review console outputs of any exper

**Where is it saved?**
- Upon importing SuperGradients, console outputs and logs will be stored in `~/sg_logs/console.log`.
- When instantiating the super_gradients.Trainer, all console outputs and logs will be redirected to the experiment folder `<ckpt_root_dir>/<experiment_name>/console_<date>.txt`.
- When instantiating the `super_gradients.Trainer`, all console outputs and logs will be redirected to the experiment folder `<ckpt_root_dir>/<experiment_name>/<run_dir>/console_<date>.txt`.

**How to set log level?** You can filter the logs displayed on the console by setting the environment variable `CONSOLE_LOG_LEVEL=<LOG-LEVEL> # DEBUG/INFO/WARNING/ERROR`

Expand All @@ -45,7 +47,7 @@ This means that it includes any log that was under the logging level (`logging.D

**What does it include?** Anything logged with a logger (`logger.log`, `logger.info`, ...), even the filtered logs.

**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/logs_<date>.txt`
**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/<run_dir>/logs_<date>.txt`

**How to set log level?** You can filter the logs saved in the file by setting the environment variable `FILE_LOG_LEVEL=<LOG-LEVEL> # DEBUG/INFO/WARNING/ERROR`

Expand All @@ -54,9 +56,8 @@ This means that it includes any log that was under the logging level (`logging.D
Only when training using hydra recipe.

**What does it include?**
```
<ckpt_root_dir>/<experiment_name>
├─ ...
```
<ckpt_root_dir>/<experiment_name>/<run_dir>/
└─ .hydra
├─config.yaml # A single config file that regroups the config files used to run the experiment
├─hydra.yaml # Some Hydra metadata
Expand All @@ -65,8 +66,8 @@ Only when training using hydra recipe.


## SUMMARY
```
<ckpt_root_dir>/<experiment_name>
```
<ckpt_root_dir>/<experiment_name>/<run_dir>/
├─ ... (all the model checkpoints)
├─ events.out.tfevents.<unique_id> # Tensorboard artifact
├─ experiment_logs_<date>.txt # Config and metrics related to experiment
Expand Down