Multi gpu PyTorch autolog - creates multiple runs #5837

shrinath-suresh · 2022-05-09T17:35:45Z

What changes are proposed in this pull request?

When training in multiple gpu environment, PyTorch autolog logs model in multiple runs. The root cause of this problem - when autolog is invoked in multi gpu environment, the autolog scripts are running per gpu which in turn creates multiple runs.

Following are the possible solutions that i could think of

Wrapping patched_fit method with rank_zero_only decorator (PR changes)

Pros: Model will be logged only once.
Cons: Still multiple empty runs will be created during multi gpu training

Wrapping the autolog method with rank_zero_only decorator - click here to view the changes

Pros: Only one run will be created in multi gpu training - solves the root cause problem
Cons: creates dependency with pytorch-lightning import in mlflow pytorch library - need to evaluate if it will cause any import error (as of today pytorch lightning is imported only inside the autolog method)

Updating the examples to invoke autolog only with rank zero gpu - click here to view the changes

Pros: No library update needed
Cons: mlflow.pytorch.autolog() can be invoked only after instantiating the trainer object (as the trainer object is need to check the global rank)

MLFlow team and reviewers, please let us know your opinion.

How is this patch tested?

Autologging tests and existing unit tests

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

…once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]>

BenWilson2 · 2022-05-12T20:46:11Z

Since a viable solution exists without introducing a core dependency change (like in option 2) and can avoid creating a different sort of issue (like in option 1), providing an update to examples (with appropriate notes mentioning why that block is in there) and a documentation update about how to configure autologging functionality for pytorch is probably a good idea.

If you wrap the block for this change with a conditional based on torch.cuda.device_count() and only trigger the alternative logic iif count > 1 (and, of course, that the training is in GPU mode i.e., cuda.is_available()) that should probably be pretty clear for people who are using multi-gpu pytorch.

Are you willing to update the examples and the documentation on pytorch?
Also, thank you for the thorough investigation into alternatives! It's very appreciated.

…el only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]>

Signed-off-by: Shrinath Suresh <[email protected]>

shrinath-suresh · 2022-05-16T18:37:26Z

Since a viable solution exists without introducing a core dependency change (like in option 2) and can avoid creating a different sort of issue (like in option 1), providing an update to examples (with appropriate notes mentioning why that block is in there) and a documentation update about how to configure autologging functionality for pytorch is probably a good idea.

If you wrap the block for this change with a conditional based on torch.cuda.device_count() and only trigger the alternative logic iif count > 1 (and, of course, that the training is in GPU mode i.e., cuda.is_available()) that should probably be pretty clear for people who are using multi-gpu pytorch.

Are you willing to update the examples and the documentation on pytorch? Also, thank you for the thorough investigation into alternatives! It's very appreciated.

@BenWilson2 Thanks for your comments. I have updated the MNIST example with the conditions for cpu and multi gpu training. Please let me know your thoughts. I will do it for all other examples if the changes looks fine.

BenWilson2 · 2022-05-17T19:56:52Z

Added some notes to your MNIST example. Please feel free to file a PR with the changes there and in the other locations (including the main docs)!
Thanks!

… flavor Signed-off-by: Shrinath Suresh <[email protected]>

Signed-off-by: Shrinath Suresh <[email protected]>

BenWilson2 · 2022-05-31T15:47:09Z

examples/pytorch/MNIST/mnist_autolog_example.py

    if dict_args["gpus"] is None or int(dict_args["gpus"]) == 0:
        mlflow.pytorch.autolog()
    elif int(dict_args["gpus"]) >= 1 and trainer.global_rank == 0:
-        # To avoid duplication of mlflow runs when the model
-        # is trained using multiple gpus
+        # In case of multi gpu training, the training script in invoked multiple times,


nit

Suggested change

# In case of multi gpu training, the training script in invoked multiple times,

# In case of multi gpu training, the training script is invoked multiple times,

BenWilson2 · 2022-05-31T15:47:46Z

examples/pytorch/MNIST/mnist_autolog_example.py

        mlflow.pytorch.autolog()
    else:
+        # This condition is met only for multi-gpu training when the global rank is non zero.
+        # Since, the parameters are already logged using global rank 0 gpu, it is safe to ignore


nit

Suggested change

# Since, the parameters are already logged using global rank 0 gpu, it is safe to ignore

# Since the parameters are already logged using global rank 0 gpu, it is safe to ignore

BenWilson2

A few nits to correct, please be sure to merge master into your branch to get the unit tests passing (the fix for the protobuf failures in examples tests is in master now), and be sure to run black on the examples.
It's looking good and very clear to users about how to prevent this unexpected behavior!

Signed-off-by: Shrinath Suresh <[email protected]>

…-fix Signed-off-by: Shrinath Suresh <[email protected]>

Signed-off-by: Shrinath Suresh <[email protected]>

This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]>

This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]>

BenWilson2

LGTM! Thank you for the contribution!

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]>

This reverts commit 816e035.

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]>

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]>

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]>

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]>

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

* add CatBoostRanker support (#6032) * Add CatBoostRanker support Signed-off-by: Daniil Gafni <[email protected]> * sorted group_id in tests Signed-off-by: Daniil Gafni <[email protected]> * fixed typo Signed-off-by: Daniil Gafni <[email protected]> * fixed typo Signed-off-by: Daniil Gafni <[email protected]> * add separate catboost_ranker test Signed-off-by: Daniil Gafni <[email protected]> * fix some issues Signed-off-by: Daniil Gafni <[email protected]> * Autoformat: https://github.com/mlflow/mlflow/actions/runs/2454908292 Signed-off-by: mlflow-automation <[email protected]> * maybe import CatBoostRanker Signed-off-by: Daniil Gafni <[email protected]> * skip -> skipif Signed-off-by: Daniil Gafni <[email protected]> * use __name__ attribute Signed-off-by: Daniil Gafni <[email protected]> * packaing.version Signed-off-by: Daniil Gafni <[email protected]> * fix lint issues Signed-off-by: harupy <[email protected]> * fix test Signed-off-by: harupy <[email protected]> * simplify test Signed-off-by: harupy <[email protected]> Co-authored-by: mlflow-automation <[email protected]> Co-authored-by: harupy <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * Multi gpu PyTorch autolog - creates multiple runs (#5837) * Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * Fix pytorch lightining lint issue (#6038) Signed-off-by: harupy <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * Fix sklearn flavor load model with code path (#6037) * init Signed-off-by: Weichen Xu <[email protected]> * update Signed-off-by: Weichen Xu <[email protected]> * update Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * fix and test Signed-off-by: Diogo Santos <[email protected]> * Revert "add CatBoostRanker support (#6032)" This reverts commit 7ff2452. Signed-off-by: Diogo Santos <[email protected]> * Revert "Multi gpu PyTorch autolog - creates multiple runs (#5837)" This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]> * Revert "Fix pytorch lightining lint issue (#6038)" This reverts commit 13ca54e. Signed-off-by: Diogo Santos <[email protected]> * Revert "Fix sklearn flavor load model with code path (#6037)" This reverts commit d382b2d. Signed-off-by: Diogo Santos <[email protected]> * revert isort changes Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Co-authored-by: Harutaka Kawamura <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * correct test Signed-off-by: Diogo Santos <[email protected]> * Update mlflow/tracking/client.py Co-authored-by: Harutaka Kawamura <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * correct tests Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Signed-off-by: Diogo Santos <[email protected]> * update test requirements Signed-off-by: Diogo Santos <[email protected]> * Autoformat: https://github.com/mlflow/mlflow/actions/runs/2496324436 Signed-off-by: mlflow-automation <[email protected]> Co-authored-by: Daniel Gafni <[email protected]> Co-authored-by: mlflow-automation <[email protected]> Co-authored-by: harupy <[email protected]> Co-authored-by: shrinath-suresh <[email protected]> Co-authored-by: WeichenXu <[email protected]>

hikushalhere · 2022-07-08T06:55:00Z

Hello @shrinath-suresh, I am confused how this change works. The DDP process group is initialized when PL's Trainer.fit() method is called (see here). Calling trainer.global_rank or torch.distributed.get_rank() before calling trainer.fit() throws the exception pasted below. What am I missing?

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Michal Karzynski <[email protected]>

* add CatBoostRanker support (mlflow#6032) * Add CatBoostRanker support Signed-off-by: Daniil Gafni <[email protected]> * sorted group_id in tests Signed-off-by: Daniil Gafni <[email protected]> * fixed typo Signed-off-by: Daniil Gafni <[email protected]> * fixed typo Signed-off-by: Daniil Gafni <[email protected]> * add separate catboost_ranker test Signed-off-by: Daniil Gafni <[email protected]> * fix some issues Signed-off-by: Daniil Gafni <[email protected]> * Autoformat: https://github.com/mlflow/mlflow/actions/runs/2454908292 Signed-off-by: mlflow-automation <[email protected]> * maybe import CatBoostRanker Signed-off-by: Daniil Gafni <[email protected]> * skip -> skipif Signed-off-by: Daniil Gafni <[email protected]> * use __name__ attribute Signed-off-by: Daniil Gafni <[email protected]> * packaing.version Signed-off-by: Daniil Gafni <[email protected]> * fix lint issues Signed-off-by: harupy <[email protected]> * fix test Signed-off-by: harupy <[email protected]> * simplify test Signed-off-by: harupy <[email protected]> Co-authored-by: mlflow-automation <[email protected]> Co-authored-by: harupy <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837) * Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]> * Updating MNIST example to autolog only once during multi gpu training Signed-off-by: Shrinath Suresh <[email protected]> * Adding comments for rank 0 fix and updating main document for pytorch flavor Signed-off-by: Shrinath Suresh <[email protected]> * Bert logging fix Signed-off-by: Shrinath Suresh <[email protected]> * Addressing review comments Signed-off-by: Shrinath Suresh <[email protected]> * Adding new line to the end of the file Signed-off-by: Shrinath Suresh <[email protected]> * Lint fixes Signed-off-by: Shrinath Suresh <[email protected]> * MNIST protobuf version set to <=3.20.1 Signed-off-by: Shrinath Suresh <[email protected]> * Pinning protobuf version to <=3.20.1 for all the examples Signed-off-by: Shrinath Suresh <[email protected]> * Revert "Pinning protobuf version to <=3.20.1 for all the examples" This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]> * Revert "MNIST protobuf version set to <=3.20.1" This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * Fix pytorch lightining lint issue (mlflow#6038) Signed-off-by: harupy <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * Fix sklearn flavor load model with code path (mlflow#6037) * init Signed-off-by: Weichen Xu <[email protected]> * update Signed-off-by: Weichen Xu <[email protected]> * update Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * fix and test Signed-off-by: Diogo Santos <[email protected]> * Revert "add CatBoostRanker support (mlflow#6032)" This reverts commit 7ff2452. Signed-off-by: Diogo Santos <[email protected]> * Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)" This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]> * Revert "Fix pytorch lightining lint issue (mlflow#6038)" This reverts commit 13ca54e. Signed-off-by: Diogo Santos <[email protected]> * Revert "Fix sklearn flavor load model with code path (mlflow#6037)" This reverts commit d382b2d. Signed-off-by: Diogo Santos <[email protected]> * revert isort changes Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Co-authored-by: Harutaka Kawamura <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * correct test Signed-off-by: Diogo Santos <[email protected]> * Update mlflow/tracking/client.py Co-authored-by: Harutaka Kawamura <[email protected]> Signed-off-by: Diogo Santos <[email protected]> * correct tests Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Signed-off-by: Diogo Santos <[email protected]> * Apply suggestions from code review Signed-off-by: Diogo Santos <[email protected]> * update test requirements Signed-off-by: Diogo Santos <[email protected]> * Autoformat: https://github.com/mlflow/mlflow/actions/runs/2496324436 Signed-off-by: mlflow-automation <[email protected]> Co-authored-by: Daniel Gafni <[email protected]> Co-authored-by: mlflow-automation <[email protected]> Co-authored-by: harupy <[email protected]> Co-authored-by: shrinath-suresh <[email protected]> Co-authored-by: WeichenXu <[email protected]> Signed-off-by: Michal Karzynski <[email protected]>

Adding rank zero decorator to patched_fit method - to log model only …

97ca87e

…once in multi gpu environment Signed-off-by: Shrinath Suresh <[email protected]>

github-actions bot added rn/bug-fix Mention under Bug Fixes in Changelogs. area/artifacts Artifact stores and artifact logging rn/none List under Small Changes in Changelogs. and removed rn/bug-fix Mention under Bug Fixes in Changelogs. labels May 9, 2022

shrinath-suresh changed the title ~~[WIP] PyTorch autolog - multi gpu~~ [WIP] Multi gpu PyTorch autolog - creates multiple runs May 10, 2022

shrinath-suresh added 2 commits May 17, 2022 00:03

Revert "Adding rank zero decorator to patched_fit method - to log mod…

e7ecd9a

…el only once in multi gpu environment" This reverts commit 97ca87e. Signed-off-by: Shrinath Suresh <[email protected]>

Updating MNIST example to autolog only once during multi gpu training

da005b7

Signed-off-by: Shrinath Suresh <[email protected]>

shrinath-suresh added 2 commits May 31, 2022 00:04

Adding comments for rank 0 fix and updating main document for pytorch…

5dd8d5d

… flavor Signed-off-by: Shrinath Suresh <[email protected]>

Bert logging fix

6b63169

Signed-off-by: Shrinath Suresh <[email protected]>

BenWilson2 reviewed May 31, 2022

View reviewed changes

shrinath-suresh added 4 commits June 1, 2022 10:39

Addressing review comments

775d642

Signed-off-by: Shrinath Suresh <[email protected]>

Merge remote-tracking branch 'upstream/master' into autolog-multi-gpu…

3d86a61

…-fix Signed-off-by: Shrinath Suresh <[email protected]>

Adding new line to the end of the file

054aec5

Signed-off-by: Shrinath Suresh <[email protected]>

Lint fixes

42c74b7

Signed-off-by: Shrinath Suresh <[email protected]>

shrinath-suresh changed the title ~~[WIP] Multi gpu PyTorch autolog - creates multiple runs~~ Multi gpu PyTorch autolog - creates multiple runs Jun 2, 2022

shrinath-suresh added 4 commits June 2, 2022 22:34

MNIST protobuf version set to <=3.20.1

4329992

Signed-off-by: Shrinath Suresh <[email protected]>

Pinning protobuf version to <=3.20.1 for all the examples

56793f9

Signed-off-by: Shrinath Suresh <[email protected]>

Revert "Pinning protobuf version to <=3.20.1 for all the examples"

ac5425d

This reverts commit 56793f9. Signed-off-by: Shrinath Suresh <[email protected]>

Revert "MNIST protobuf version set to <=3.20.1"

c412387

This reverts commit 4329992. Signed-off-by: Shrinath Suresh <[email protected]>

BenWilson2 approved these changes Jun 7, 2022

View reviewed changes

BenWilson2 merged commit 816e035 into mlflow:master Jun 7, 2022

drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 10, 2022

Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)"

a1f3292

This reverts commit 816e035.

drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 10, 2022

Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)"

5a40484

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 13, 2022

Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)"

c638fcf

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 14, 2022

Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)"

15b3d88

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 14, 2022

Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)"

251521b

This reverts commit 816e035. Signed-off-by: Diogo Santos <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi gpu PyTorch autolog - creates multiple runs #5837

Multi gpu PyTorch autolog - creates multiple runs #5837

shrinath-suresh commented May 9, 2022 •

edited by harupy

Loading

BenWilson2 commented May 12, 2022

shrinath-suresh commented May 16, 2022 •

edited

Loading

BenWilson2 commented May 17, 2022

BenWilson2 May 31, 2022

shrinath-suresh Jun 2, 2022

BenWilson2 May 31, 2022

shrinath-suresh Jun 2, 2022

BenWilson2 left a comment

BenWilson2 left a comment

hikushalhere commented Jul 8, 2022 •

edited

Loading

	# In case of multi gpu training, the training script in invoked multiple times,
	# In case of multi gpu training, the training script is invoked multiple times,

	# Since, the parameters are already logged using global rank 0 gpu, it is safe to ignore
	# Since the parameters are already logged using global rank 0 gpu, it is safe to ignore

Multi gpu PyTorch autolog - creates multiple runs #5837

Multi gpu PyTorch autolog - creates multiple runs #5837

Conversation

shrinath-suresh commented May 9, 2022 • edited by harupy Loading

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

BenWilson2 commented May 12, 2022

shrinath-suresh commented May 16, 2022 • edited Loading

BenWilson2 commented May 17, 2022

BenWilson2 May 31, 2022

Choose a reason for hiding this comment

shrinath-suresh Jun 2, 2022

Choose a reason for hiding this comment

BenWilson2 May 31, 2022

Choose a reason for hiding this comment

shrinath-suresh Jun 2, 2022

Choose a reason for hiding this comment

BenWilson2 left a comment

Choose a reason for hiding this comment

BenWilson2 left a comment

Choose a reason for hiding this comment

hikushalhere commented Jul 8, 2022 • edited Loading

shrinath-suresh commented May 9, 2022 •

edited by harupy

Loading

shrinath-suresh commented May 16, 2022 •

edited

Loading

hikushalhere commented Jul 8, 2022 •

edited

Loading