Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi gpu PyTorch autolog - creates multiple runs #5837

Merged
merged 13 commits into from
Jun 7, 2022

Conversation

shrinath-suresh
Copy link
Contributor

@shrinath-suresh shrinath-suresh commented May 9, 2022

What changes are proposed in this pull request?

Fix #5817 .

When training in multiple gpu environment, PyTorch autolog logs model in multiple runs. The root cause of this problem - when autolog is invoked in multi gpu environment, the autolog scripts are running per gpu which in turn creates multiple runs.

Following are the possible solutions that i could think of

  1. Wrapping patched_fit method with rank_zero_only decorator (PR changes)

Pros: Model will be logged only once.
Cons: Still multiple empty runs will be created during multi gpu training

  1. Wrapping the autolog method with rank_zero_only decorator - click here to view the changes

Pros: Only one run will be created in multi gpu training - solves the root cause problem
Cons: creates dependency with pytorch-lightning import in mlflow pytorch library - need to evaluate if it will cause any import error (as of today pytorch lightning is imported only inside the autolog method)

  1. Updating the examples to invoke autolog only with rank zero gpu - click here to view the changes

Pros: No library update needed
Cons: mlflow.pytorch.autolog() can be invoked only after instantiating the trainer object (as the trainer object is need to check the global rank)

MLFlow team and reviewers, please let us know your opinion.

How is this patch tested?

Autologging tests and existing unit tests

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

…once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>
@github-actions github-actions bot added rn/bug-fix Mention under Bug Fixes in Changelogs. area/artifacts Artifact stores and artifact logging rn/none List under Small Changes in Changelogs. and removed rn/bug-fix Mention under Bug Fixes in Changelogs. labels May 9, 2022
@shrinath-suresh shrinath-suresh changed the title [WIP] PyTorch autolog - multi gpu [WIP] Multi gpu PyTorch autolog - creates multiple runs May 10, 2022
@BenWilson2
Copy link
Member

Since a viable solution exists without introducing a core dependency change (like in option 2) and can avoid creating a different sort of issue (like in option 1), providing an update to examples (with appropriate notes mentioning why that block is in there) and a documentation update about how to configure autologging functionality for pytorch is probably a good idea.

If you wrap the block for this change with a conditional based on torch.cuda.device_count() and only trigger the alternative logic iif count > 1 (and, of course, that the training is in GPU mode i.e., cuda.is_available()) that should probably be pretty clear for people who are using multi-gpu pytorch.

Are you willing to update the examples and the documentation on pytorch?
Also, thank you for the thorough investigation into alternatives! It's very appreciated.

…el only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>
@shrinath-suresh
Copy link
Contributor Author

shrinath-suresh commented May 16, 2022

Since a viable solution exists without introducing a core dependency change (like in option 2) and can avoid creating a different sort of issue (like in option 1), providing an update to examples (with appropriate notes mentioning why that block is in there) and a documentation update about how to configure autologging functionality for pytorch is probably a good idea.

If you wrap the block for this change with a conditional based on torch.cuda.device_count() and only trigger the alternative logic iif count > 1 (and, of course, that the training is in GPU mode i.e., cuda.is_available()) that should probably be pretty clear for people who are using multi-gpu pytorch.

Are you willing to update the examples and the documentation on pytorch? Also, thank you for the thorough investigation into alternatives! It's very appreciated.

@BenWilson2 Thanks for your comments. I have updated the MNIST example with the conditions for cpu and multi gpu training. Please let me know your thoughts. I will do it for all other examples if the changes looks fine.

@BenWilson2
Copy link
Member

Added some notes to your MNIST example. Please feel free to file a PR with the changes there and in the other locations (including the main docs)!
Thanks!

if dict_args["gpus"] is None or int(dict_args["gpus"]) == 0:
mlflow.pytorch.autolog()
elif int(dict_args["gpus"]) >= 1 and trainer.global_rank == 0:
# To avoid duplication of mlflow runs when the model
# is trained using multiple gpus
# In case of multi gpu training, the training script in invoked multiple times,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
# In case of multi gpu training, the training script in invoked multiple times,
# In case of multi gpu training, the training script is invoked multiple times,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mlflow.pytorch.autolog()
else:
# This condition is met only for multi-gpu training when the global rank is non zero.
# Since, the parameters are already logged using global rank 0 gpu, it is safe to ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
# Since, the parameters are already logged using global rank 0 gpu, it is safe to ignore
# Since the parameters are already logged using global rank 0 gpu, it is safe to ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits to correct, please be sure to merge master into your branch to get the unit tests passing (the fix for the protobuf failures in examples tests is in master now), and be sure to run black on the examples.
It's looking good and very clear to users about how to prevent this unexpected behavior!

@shrinath-suresh shrinath-suresh changed the title [WIP] Multi gpu PyTorch autolog - creates multiple runs Multi gpu PyTorch autolog - creates multiple runs Jun 2, 2022
Copy link
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the contribution!

@BenWilson2 BenWilson2 merged commit 816e035 into mlflow:master Jun 7, 2022
drsantos89 pushed a commit to drsantos89/mlflow that referenced this pull request Jun 9, 2022
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>
drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 10, 2022
drsantos89 pushed a commit to drsantos89/mlflow that referenced this pull request Jun 10, 2022
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>
drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 10, 2022
drsantos89 pushed a commit to drsantos89/mlflow that referenced this pull request Jun 13, 2022
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>
drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 13, 2022
drsantos89 pushed a commit to drsantos89/mlflow that referenced this pull request Jun 14, 2022
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>
drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 14, 2022
drsantos89 pushed a commit to drsantos89/mlflow that referenced this pull request Jun 14, 2022
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>
drsantos89 added a commit to drsantos89/mlflow that referenced this pull request Jun 14, 2022
harupy added a commit that referenced this pull request Jun 14, 2022
* add CatBoostRanker support (#6032)

* Add CatBoostRanker support

Signed-off-by: Daniil Gafni <[email protected]>

* sorted group_id in tests

Signed-off-by: Daniil Gafni <[email protected]>

* fixed typo

Signed-off-by: Daniil Gafni <[email protected]>

* fixed typo

Signed-off-by: Daniil Gafni <[email protected]>

* add separate catboost_ranker test

Signed-off-by: Daniil Gafni <[email protected]>

* fix some issues

Signed-off-by: Daniil Gafni <[email protected]>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/2454908292

Signed-off-by: mlflow-automation <[email protected]>

* maybe import CatBoostRanker

Signed-off-by: Daniil Gafni <[email protected]>

* skip -> skipif

Signed-off-by: Daniil Gafni <[email protected]>

* use __name__ attribute

Signed-off-by: Daniil Gafni <[email protected]>

* packaing.version

Signed-off-by: Daniil Gafni <[email protected]>

* fix lint issues

Signed-off-by: harupy <[email protected]>

* fix test

Signed-off-by: harupy <[email protected]>

* simplify test

Signed-off-by: harupy <[email protected]>

Co-authored-by: mlflow-automation <[email protected]>
Co-authored-by: harupy <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* Multi gpu PyTorch autolog - creates multiple runs (#5837)

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* Fix pytorch lightining lint issue (#6038)

Signed-off-by: harupy <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* Fix sklearn flavor load model with code path (#6037)

* init

Signed-off-by: Weichen Xu <[email protected]>

* update

Signed-off-by: Weichen Xu <[email protected]>

* update

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* fix and test

Signed-off-by: Diogo Santos <[email protected]>

* Revert "add CatBoostRanker support (#6032)"

This reverts commit 7ff2452.

Signed-off-by: Diogo Santos <[email protected]>

* Revert "Multi gpu PyTorch autolog - creates multiple runs (#5837)"

This reverts commit 816e035.

Signed-off-by: Diogo Santos <[email protected]>

* Revert "Fix pytorch lightining lint issue (#6038)"

This reverts commit 13ca54e.

Signed-off-by: Diogo Santos <[email protected]>

* Revert "Fix sklearn flavor load model with code path (#6037)"

This reverts commit d382b2d.

Signed-off-by: Diogo Santos <[email protected]>

* revert isort changes

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Co-authored-by: Harutaka Kawamura <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* correct test

Signed-off-by: Diogo Santos <[email protected]>

* Update mlflow/tracking/client.py

Co-authored-by: Harutaka Kawamura <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* correct tests

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Signed-off-by: Diogo Santos <[email protected]>

* update test requirements

Signed-off-by: Diogo Santos <[email protected]>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/2496324436

Signed-off-by: mlflow-automation <[email protected]>

Co-authored-by: Daniel Gafni <[email protected]>
Co-authored-by: mlflow-automation <[email protected]>
Co-authored-by: harupy <[email protected]>
Co-authored-by: shrinath-suresh <[email protected]>
Co-authored-by: WeichenXu <[email protected]>
@hikushalhere
Copy link

hikushalhere commented Jul 8, 2022

Hello @shrinath-suresh, I am confused how this change works. The DDP process group is initialized when PL's Trainer.fit() method is called (see here). Calling trainer.global_rank or torch.distributed.get_rank() before calling trainer.fit() throws the exception pasted below. What am I missing?

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

postrational pushed a commit to postrational/mlflow that referenced this pull request Jul 27, 2022
* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Michal Karzynski <[email protected]>
postrational pushed a commit to postrational/mlflow that referenced this pull request Jul 27, 2022
* add CatBoostRanker support (mlflow#6032)

* Add CatBoostRanker support

Signed-off-by: Daniil Gafni <[email protected]>

* sorted group_id in tests

Signed-off-by: Daniil Gafni <[email protected]>

* fixed typo

Signed-off-by: Daniil Gafni <[email protected]>

* fixed typo

Signed-off-by: Daniil Gafni <[email protected]>

* add separate catboost_ranker test

Signed-off-by: Daniil Gafni <[email protected]>

* fix some issues

Signed-off-by: Daniil Gafni <[email protected]>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/2454908292

Signed-off-by: mlflow-automation <[email protected]>

* maybe import CatBoostRanker

Signed-off-by: Daniil Gafni <[email protected]>

* skip -> skipif

Signed-off-by: Daniil Gafni <[email protected]>

* use __name__ attribute

Signed-off-by: Daniil Gafni <[email protected]>

* packaing.version

Signed-off-by: Daniil Gafni <[email protected]>

* fix lint issues

Signed-off-by: harupy <[email protected]>

* fix test

Signed-off-by: harupy <[email protected]>

* simplify test

Signed-off-by: harupy <[email protected]>

Co-authored-by: mlflow-automation <[email protected]>
Co-authored-by: harupy <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)

* Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Adding rank zero decorator to patched_fit method - to log model only once in multi gpu environment"

This reverts commit 97ca87e.

Signed-off-by: Shrinath Suresh <[email protected]>

* Updating MNIST example to autolog only once during multi gpu training

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding comments for rank 0 fix and updating main document for pytorch flavor

Signed-off-by: Shrinath Suresh <[email protected]>

* Bert logging fix

Signed-off-by: Shrinath Suresh <[email protected]>

* Addressing review comments

Signed-off-by: Shrinath Suresh <[email protected]>

* Adding new line to the end of the file

Signed-off-by: Shrinath Suresh <[email protected]>

* Lint fixes

Signed-off-by: Shrinath Suresh <[email protected]>

* MNIST protobuf version set to <=3.20.1

Signed-off-by: Shrinath Suresh <[email protected]>

* Pinning protobuf version to <=3.20.1 for all the examples

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "Pinning protobuf version to <=3.20.1 for all the examples"

This reverts commit 56793f9.

Signed-off-by: Shrinath Suresh <[email protected]>

* Revert "MNIST protobuf version set to <=3.20.1"

This reverts commit 4329992.

Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* Fix pytorch lightining lint issue (mlflow#6038)

Signed-off-by: harupy <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* Fix sklearn flavor load model with code path (mlflow#6037)

* init

Signed-off-by: Weichen Xu <[email protected]>

* update

Signed-off-by: Weichen Xu <[email protected]>

* update

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* fix and test

Signed-off-by: Diogo Santos <[email protected]>

* Revert "add CatBoostRanker support (mlflow#6032)"

This reverts commit 7ff2452.

Signed-off-by: Diogo Santos <[email protected]>

* Revert "Multi gpu PyTorch autolog - creates multiple runs (mlflow#5837)"

This reverts commit 816e035.

Signed-off-by: Diogo Santos <[email protected]>

* Revert "Fix pytorch lightining lint issue (mlflow#6038)"

This reverts commit 13ca54e.

Signed-off-by: Diogo Santos <[email protected]>

* Revert "Fix sklearn flavor load model with code path (mlflow#6037)"

This reverts commit d382b2d.

Signed-off-by: Diogo Santos <[email protected]>

* revert isort changes

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Co-authored-by: Harutaka Kawamura <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* correct test

Signed-off-by: Diogo Santos <[email protected]>

* Update mlflow/tracking/client.py

Co-authored-by: Harutaka Kawamura <[email protected]>
Signed-off-by: Diogo Santos <[email protected]>

* correct tests

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Signed-off-by: Diogo Santos <[email protected]>

* Apply suggestions from code review

Signed-off-by: Diogo Santos <[email protected]>

* update test requirements

Signed-off-by: Diogo Santos <[email protected]>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/2496324436

Signed-off-by: mlflow-automation <[email protected]>

Co-authored-by: Daniel Gafni <[email protected]>
Co-authored-by: mlflow-automation <[email protected]>
Co-authored-by: harupy <[email protected]>
Co-authored-by: shrinath-suresh <[email protected]>
Co-authored-by: WeichenXu <[email protected]>
Signed-off-by: Michal Karzynski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Follow-up Question - Mlflow creates multiple logs for distributed training with multiple workers
3 participants