Skip to content

Commit

Permalink
Multi gpu PyTorch autolog - creates multiple runs (#5837)
Browse files Browse the repository at this point in the history
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
  • Loading branch information
shrinath-suresh authored Jun 7, 2022
1 parent 7ff2452 commit 816e035
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 4 deletions.
4 changes: 4 additions & 0 deletions docs/source/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -659,6 +659,10 @@ produced by :py:func:`mlflow.pytorch.save_model()` and :py:func:`mlflow.pytorch.
the ``python_function`` flavor, allowing you to load them as generic Python functions for inference
via :py:func:`mlflow.pyfunc.load_model()`.

.. note::
In case of multi gpu training, ensure to save the model only with global rank 0 gpu. This avoids
logging multiple copies of the same model.

For more information, see :py:mod:`mlflow.pytorch`.

Scikit-learn (``sklearn``)
Expand Down
21 changes: 19 additions & 2 deletions examples/pytorch/BertNewsClassification/bert_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os
from argparse import ArgumentParser

import logging
import numpy as np
import pandas as pd
import pytorch_lightning as pl
Expand Down Expand Up @@ -424,8 +425,6 @@ def configure_optimizers(self):
parser = BertNewsClassifier.add_model_specific_args(parent_parser=parser)
parser = BertDataModule.add_model_specific_args(parent_parser=parser)

mlflow.pytorch.autolog()

args = parser.parse_args()
dict_args = vars(args)

Expand Down Expand Up @@ -455,5 +454,23 @@ def configure_optimizers(self):
enable_checkpointing=True,
)

# It is safe to use `mlflow.pytorch.autolog` in DDP training, as below condition invokes
# autolog with only rank 0 gpu.

# For CPU Training
if dict_args["gpus"] is None or int(dict_args["gpus"]) == 0:
mlflow.pytorch.autolog()
elif int(dict_args["gpus"]) >= 1 and trainer.global_rank == 0:
# In case of multi gpu training, the training script is invoked multiple times,
# The following condition is needed to avoid multiple copies of mlflow runs.
# When one or more gpus are used for training, it is enough to save
# the model and its parameters using rank 0 gpu.
mlflow.pytorch.autolog()
else:
# This condition is met only for multi-gpu training when the global rank is non zero.
# Since the parameters are already logged using global rank 0 gpu, it is safe to ignore
# this condition.
logging.info("Active run exists.. ")

trainer.fit(model, dm)
trainer.test(model, datamodule=dm)
22 changes: 20 additions & 2 deletions examples/pytorch/MNIST/mnist_autolog_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
# pylint: disable=abstract-method
import pytorch_lightning as pl
import mlflow.pytorch
import logging
import os
import torch
from argparse import ArgumentParser
Expand Down Expand Up @@ -274,8 +275,6 @@ def configure_optimizers(self):
parser = pl.Trainer.add_argparse_args(parent_parser=parser)
parser = LightningMNISTClassifier.add_model_specific_args(parent_parser=parser)

mlflow.pytorch.autolog()

args = parser.parse_args()
dict_args = vars(args)

Expand Down Expand Up @@ -303,5 +302,24 @@ def configure_optimizers(self):
trainer = pl.Trainer.from_argparse_args(
args, callbacks=[lr_logger, early_stopping, checkpoint_callback], checkpoint_callback=True
)

# It is safe to use `mlflow.pytorch.autolog` in DDP training, as below condition invokes
# autolog with only rank 0 gpu.

# For CPU Training
if dict_args["gpus"] is None or int(dict_args["gpus"]) == 0:
mlflow.pytorch.autolog()
elif int(dict_args["gpus"]) >= 1 and trainer.global_rank == 0:
# In case of multi gpu training, the training script is invoked multiple times,
# The following condition is needed to avoid multiple copies of mlflow runs.
# When one or more gpus are used for training, it is enough to save
# the model and its parameters using rank 0 gpu.
mlflow.pytorch.autolog()
else:
# This condition is met only for multi-gpu training when the global rank is non zero.
# Since the parameters are already logged using global rank 0 gpu, it is safe to ignore
# this condition.
logging.info("Active run exists.. ")

trainer.fit(model, dm)
trainer.test(datamodule=dm)

0 comments on commit 816e035

Please sign in to comment.