-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add torch elastic task type #1583
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1583 +/- ##
==========================================
+ Coverage 69.93% 70.12% +0.18%
==========================================
Files 319 328 +9
Lines 29569 29973 +404
Branches 5332 5408 +76
==========================================
+ Hits 20680 21018 +338
- Misses 8370 8419 +49
- Partials 519 536 +17
... and 7 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -13,8 +13,7 @@ | |||
|
|||
@dataclass | |||
class Elastic(object): | |||
min_replicas: int = 1 | |||
max_replicas: int = 1 | |||
nnodes: typing.Union[int, str] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the way torchrun exposes this to the user, e.g. --nnodes 2:3
bcb423f
to
e647383
Compare
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Ketan Umare <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Ketan Umare <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Unify sqlalchemy Dockerfiles Signed-off-by: eduardo apolinario <[email protected]> * Simplifyte sqlalchemy Dockerfile Signed-off-by: eduardo apolinario <[email protected]> * Set pythonpath to /root Signed-off-by: eduardo apolinario <[email protected]> --------- Signed-off-by: eduardo apolinario <[email protected]> Co-authored-by: eduardo apolinario <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Ketan Umare <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * pyflyte build Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * Support serialize and package Signed-off-by: Kevin Su <[email protected]> * more tests Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * move to plugin Signed-off-by: Kevin Su <[email protected]> * test Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * test Signed-off-by: Kevin Su <[email protected]> * test Signed-off-by: Kevin Su <[email protected]> * test Signed-off-by: Kevin Su <[email protected]> * test Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * wip Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * fixed tested Signed-off-by: Kevin Su <[email protected]> * fixed tested Signed-off-by: Kevin Su <[email protected]> * more tests Signed-off-by: Kevin Su <[email protected]> * Add support passing yaml in pyflyte run Signed-off-by: Kevin Su <[email protected]> * lint Signed-off-by: Kevin Su <[email protected]> * lint Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
…ror messages (#1582) * improve input type conversion error Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> * fix Signed-off-by: Niels Bantilan <[email protected]> * add tests Signed-off-by: Niels Bantilan <[email protected]> * add tests Signed-off-by: Niels Bantilan <[email protected]> * add rich Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> * remove prototyping script, update loggers Signed-off-by: Niels Bantilan <[email protected]> * update __init__.py Signed-off-by: Niels Bantilan <[email protected]> * update logger Signed-off-by: Niels Bantilan <[email protected]> * update logger Signed-off-by: Niels Bantilan <[email protected]> * fix GE and pandera tests Signed-off-by: Niels Bantilan <[email protected]> * fix lint Signed-off-by: Niels Bantilan <[email protected]> --------- Signed-off-by: Niels Bantilan <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* Add support nested FlyteFile in pyflyte run Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* Set cache_regions=True in the case of s3fs * Linting Signed-off-by: eduardo apolinario <[email protected]> * Fix tests Signed-off-by: eduardo apolinario <[email protected]> * More linting Signed-off-by: eduardo apolinario <[email protected]> --------- Signed-off-by: eduardo apolinario <[email protected]> Co-authored-by: eduardo apolinario <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* set default ComSpec when running on Windows Signed-off-by: Kevin Su <[email protected]> * lint Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* add activate-launchplan command to pyflyte Signed-off-by: Artem Petrov <[email protected]> * fix linting Signed-off-by: Artem Petrov <[email protected]> * rework launchplan command Signed-off-by: Artem Petrov <[email protected]> * fix linting Signed-off-by: Artem Petrov <[email protected]> --------- Signed-off-by: Artem Petrov <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* checkpoint Signed-off-by: Mike Zhong <[email protected]> * Experimental implementation works well. Instead of messing with the class, we utilize the interpolation and context to dynamically generate a wrapper around the desired script. In the wrapper, we cd to the ctx.working_directory, export the env variables (handled by an additional method to convert dict to str), and then pass the arguments to the script Signed-off-by: Mike Zhong <[email protected]> * fix spacing Signed-off-by: Mike Zhong <[email protected]> * remove breakpoint Signed-off-by: Mike Zhong <[email protected]> * Output ctx.working_directory as single output Signed-off-by: Mike Zhong <[email protected]> * more doc strings Signed-off-by: Mike Zhong <[email protected]> * more comments Signed-off-by: Mike Zhong <[email protected]> * Added comments Signed-off-by: Mike Zhong <[email protected]> * Add tests and test files from other branch Signed-off-by: Mike Zhong <[email protected]> * minor fix, have function return the instance rather than create. It seems to get registered when flyte packages your project Signed-off-by: Mike Zhong <[email protected]> * fix tests Signed-off-by: Mike Zhong <[email protected]> * remove set flags not supported by sh Signed-off-by: Mike Zhong <[email protected]> * fix spellcheck lint errors Signed-off-by: Mike Zhong <[email protected]> * don't have a windows equivalent test script so bypassing those tests for now Signed-off-by: Mike Zhong <[email protected]> * Add typing to make_export_string_from_env_dict params Signed-off-by: Mike Zhong <[email protected]> * fixup doc string Signed-off-by: Mike Zhong <[email protected]> * Refactored the new behavior out into a new class. Did not change implementation details at all Signed-off-by: Mike Zhong <[email protected]> * Address linter issues and up test cov Signed-off-by: Mike Zhong <[email protected]> * Skip test on windows, no equivalent script Signed-off-by: Mike Zhong <[email protected]> * Run black and isort, address SC2236 Signed-off-by: Mike Zhong <[email protected]> * Refactored class name to remove _, make utility function require name so multiple uses in a workflow don't break Signed-off-by: Mike Zhong <[email protected]> * fix utility function Signed-off-by: Mike Zhong <[email protected]> * fix utility function call in tests Signed-off-by: Mike Zhong <[email protected]> * Fix isort on test_shell.py Signed-off-by: Mike Zhong <[email protected]> * adding logging settings to papermill plugin Signed-off-by: Calvin Leather <[email protected]> * set level to info logging * improved comments/docs --------- Signed-off-by: Mike Zhong <[email protected]> Signed-off-by: Calvin Leather <[email protected]> Co-authored-by: Mike Zhong <[email protected]> Co-authored-by: Mike Zhong <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* set lower bound for s3fs Signed-off-by: Kevin Su <[email protected]> * update doc-requirements.txt Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* Add horovod task to mpi plugin Signed-off-by: byhsu <[email protected]> * Remove unused Signed-off-by: byhsu <[email protected]> * add test Signed-off-by: byhsu <[email protected]> * inherit from mpifunctiontask Signed-off-by: byhsu <[email protected]> * fix format Signed-off-by: byhsu <[email protected]> * fix format Signed-off-by: byhsu <[email protected]> --------- Signed-off-by: byhsu <[email protected]> Co-authored-by: byhsu <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: esad <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
) Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.10.0 to 2.11.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](tensorflow/tensorflow@v2.10.0...v2.11.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: eduardo apolinario <[email protected]> Co-authored-by: eduardo apolinario <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* pyflyte run imperative workflow Signed-off-by: Kevin Su <[email protected]> * nit Signed-off-by: Kevin Su <[email protected]> * lint Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* Backfill fix - Backfill was using incorrect arguments - backfill should use the argument that user provides or none at all Signed-off-by: Ketan Umare <[email protected]> * Updated code Signed-off-by: Ketan Umare <[email protected]> * fixed unit test Signed-off-by: Ketan Umare <[email protected]> --------- Signed-off-by: Ketan Umare <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
* Pyflyte prettified Signed-off-by: Ketan Umare <[email protected]> * added dependency Signed-off-by: Ketan Umare <[email protected]> --------- Signed-off-by: Ketan Umare <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
…'elastic'] Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
223c82d
to
379a68b
Compare
Signed-off-by: Fabio Graetz <[email protected]>
Tests are failing because flyteidl needs to be released and updated here. |
TL;DR
This PR brings torch elastic training (
torchrun
) to the pytorch plugin:Type
Are all requirements met?
Complete description
See flyteorg/flyte#3614 for motivation and a more detailed description.
Tracking Issue
flyteorg/flyte#3614
Follow-up issue