Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add torch elastic task type #1583

Closed
wants to merge 40 commits into from
Closed

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Apr 5, 2023

TL;DR

This PR brings torch elastic training (torchrun) to the pytorch plugin:

from flytekitplugins.kfpytorch import Elastic

@task(
    task_config=Elastic(
        replicas=4,
        nproc_per_node=4,
        ...
    ),
    ...
)
def train(...):
    ...

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

See flyteorg/flyte#3614 for motivation and a more detailed description.

Tracking Issue

flyteorg/flyte#3614

Follow-up issue

@fg91 fg91 mentioned this pull request Apr 5, 2023
8 tasks
@codecov
Copy link

codecov bot commented Apr 5, 2023

Codecov Report

Merging #1583 (d5ab6ac) into master (0063b29) will increase coverage by 0.18%.
The diff coverage is 79.06%.

@@            Coverage Diff             @@
##           master    #1583      +/-   ##
==========================================
+ Coverage   69.93%   70.12%   +0.18%     
==========================================
  Files         319      328       +9     
  Lines       29569    29973     +404     
  Branches     5332     5408      +76     
==========================================
+ Hits        20680    21018     +338     
- Misses       8370     8419      +49     
- Partials      519      536      +17     
Impacted Files Coverage Δ
flytekit/core/task.py 33.33% <0.00%> (-0.76%) ⬇️
flytekit/image_spec/__init__.py 0.00% <0.00%> (ø)
flytekit/loggers.py 0.00% <0.00%> (ø)
flytekit/tools/translator.py 75.17% <ø> (+0.78%) ⬆️
plugins/setup.py 0.00% <ø> (ø)
setup.py 0.00% <ø> (ø)
tests/flytekit/unit/cli/pyflyte/test_backfill.py 100.00% <ø> (ø)
tests/flytekit/unit/core/test_type_engine.py 98.62% <ø> (ø)
flytekit/remote/remote.py 39.93% <20.00%> (-0.14%) ⬇️
flytekit/tools/repo.py 81.66% <33.33%> (-1.24%) ⬇️
... and 41 more

... and 7 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -13,8 +13,7 @@

@dataclass
class Elastic(object):
min_replicas: int = 1
max_replicas: int = 1
nnodes: typing.Union[int, str] = 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the way torchrun exposes this to the user, e.g. --nnodes 2:3

@fg91 fg91 self-assigned this Apr 23, 2023
@fg91 fg91 added the enhancement New feature or request label Apr 23, 2023
Fabio Grätz and others added 21 commits April 23, 2023 10:44
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Unify sqlalchemy Dockerfiles

Signed-off-by: eduardo apolinario <[email protected]>

* Simplifyte sqlalchemy Dockerfile

Signed-off-by: eduardo apolinario <[email protected]>

* Set pythonpath to /root

Signed-off-by: eduardo apolinario <[email protected]>

---------

Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* pyflyte build

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* Support serialize and package

Signed-off-by: Kevin Su <[email protected]>

* more tests

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* move to plugin

Signed-off-by: Kevin Su <[email protected]>

* test

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* test

Signed-off-by: Kevin Su <[email protected]>

* test

Signed-off-by: Kevin Su <[email protected]>

* test

Signed-off-by: Kevin Su <[email protected]>

* test

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* wip

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* fixed tested

Signed-off-by: Kevin Su <[email protected]>

* fixed tested

Signed-off-by: Kevin Su <[email protected]>

* more tests

Signed-off-by: Kevin Su <[email protected]>

* Add support passing yaml in pyflyte run

Signed-off-by: Kevin Su <[email protected]>

* lint

Signed-off-by: Kevin Su <[email protected]>

* lint

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
…ror messages (#1582)

* improve input type conversion error

Signed-off-by: Niels Bantilan <[email protected]>

* fix lint

Signed-off-by: Niels Bantilan <[email protected]>

* fix

Signed-off-by: Niels Bantilan <[email protected]>

* add tests

Signed-off-by: Niels Bantilan <[email protected]>

* add tests

Signed-off-by: Niels Bantilan <[email protected]>

* add rich

Signed-off-by: Niels Bantilan <[email protected]>

* fix lint

Signed-off-by: Niels Bantilan <[email protected]>

* remove prototyping script, update loggers

Signed-off-by: Niels Bantilan <[email protected]>

* update __init__.py

Signed-off-by: Niels Bantilan <[email protected]>

* update logger

Signed-off-by: Niels Bantilan <[email protected]>

* update logger

Signed-off-by: Niels Bantilan <[email protected]>

* fix GE and pandera tests

Signed-off-by: Niels Bantilan <[email protected]>

* fix lint

Signed-off-by: Niels Bantilan <[email protected]>

---------

Signed-off-by: Niels Bantilan <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
pingsutw and others added 18 commits April 23, 2023 10:44
* Add support nested FlyteFile in pyflyte run

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Set cache_regions=True in the case of s3fs

* Linting

Signed-off-by: eduardo apolinario <[email protected]>

* Fix tests

Signed-off-by: eduardo apolinario <[email protected]>

* More linting

Signed-off-by: eduardo apolinario <[email protected]>

---------

Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* set default ComSpec when running on Windows

Signed-off-by: Kevin Su <[email protected]>

* lint

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* add activate-launchplan command to pyflyte

Signed-off-by: Artem Petrov <[email protected]>

* fix linting

Signed-off-by: Artem Petrov <[email protected]>

* rework launchplan command

Signed-off-by: Artem Petrov <[email protected]>

* fix linting

Signed-off-by: Artem Petrov <[email protected]>

---------

Signed-off-by: Artem Petrov <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* checkpoint

Signed-off-by: Mike Zhong <[email protected]>

* Experimental implementation works well. Instead of messing with the class, we utilize the interpolation and context to dynamically generate a wrapper around the desired script. In the wrapper, we cd to the ctx.working_directory, export the env variables (handled by an additional method to convert dict to str), and then pass the arguments to the script

Signed-off-by: Mike Zhong <[email protected]>

* fix spacing

Signed-off-by: Mike Zhong <[email protected]>

* remove breakpoint

Signed-off-by: Mike Zhong <[email protected]>

* Output ctx.working_directory as single output

Signed-off-by: Mike Zhong <[email protected]>

* more doc strings

Signed-off-by: Mike Zhong <[email protected]>

* more comments

Signed-off-by: Mike Zhong <[email protected]>

* Added comments

Signed-off-by: Mike Zhong <[email protected]>

* Add tests and test files from other branch

Signed-off-by: Mike Zhong <[email protected]>

* minor fix, have function return the instance rather than create. It seems to get registered when flyte packages your project

Signed-off-by: Mike Zhong <[email protected]>

* fix tests

Signed-off-by: Mike Zhong <[email protected]>

* remove set flags not supported by sh

Signed-off-by: Mike Zhong <[email protected]>

* fix spellcheck lint errors

Signed-off-by: Mike Zhong <[email protected]>

* don't have a windows equivalent test script so bypassing those tests for now

Signed-off-by: Mike Zhong <[email protected]>

* Add typing to make_export_string_from_env_dict params

Signed-off-by: Mike Zhong <[email protected]>

* fixup doc string

Signed-off-by: Mike Zhong <[email protected]>

* Refactored the new behavior out into a new class. Did not change implementation details at all

Signed-off-by: Mike Zhong <[email protected]>

* Address linter issues and up test cov

Signed-off-by: Mike Zhong <[email protected]>

* Skip test on windows, no equivalent script

Signed-off-by: Mike Zhong <[email protected]>

* Run black and isort, address SC2236

Signed-off-by: Mike Zhong <[email protected]>

* Refactored class name to remove _, make utility function require name so multiple uses in a workflow don't break

Signed-off-by: Mike Zhong <[email protected]>

* fix utility function

Signed-off-by: Mike Zhong <[email protected]>

* fix utility function call in tests

Signed-off-by: Mike Zhong <[email protected]>

* Fix isort on test_shell.py

Signed-off-by: Mike Zhong <[email protected]>

* adding logging settings to papermill plugin

Signed-off-by: Calvin Leather <[email protected]>

* set level to info logging

* improved comments/docs

---------

Signed-off-by: Mike Zhong <[email protected]>
Signed-off-by: Calvin Leather <[email protected]>
Co-authored-by: Mike Zhong <[email protected]>
Co-authored-by: Mike Zhong <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* set lower bound for s3fs

Signed-off-by: Kevin Su <[email protected]>

* update doc-requirements.txt

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Add horovod task to mpi plugin

Signed-off-by: byhsu <[email protected]>

* Remove unused

Signed-off-by: byhsu <[email protected]>

* add test

Signed-off-by: byhsu <[email protected]>

* inherit from mpifunctiontask

Signed-off-by: byhsu <[email protected]>

* fix format

Signed-off-by: byhsu <[email protected]>

* fix format

Signed-off-by: byhsu <[email protected]>

---------

Signed-off-by: byhsu <[email protected]>
Co-authored-by: byhsu <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.10.0 to 2.11.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorflow@v2.10.0...v2.11.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* pyflyte run imperative workflow

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* lint

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Backfill fix

- Backfill was using incorrect arguments
- backfill should use the argument that user provides or none at all

Signed-off-by: Ketan Umare <[email protected]>

* Updated code

Signed-off-by: Ketan Umare <[email protected]>

* fixed unit test

Signed-off-by: Ketan Umare <[email protected]>

---------

Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Pyflyte prettified

Signed-off-by: Ketan Umare <[email protected]>

* added dependency

Signed-off-by: Ketan Umare <[email protected]>

---------

Signed-off-by: Ketan Umare <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
@fg91 fg91 force-pushed the fabio/feat/torch-elastic-plugin branch from 223c82d to 379a68b Compare April 23, 2023 10:45
Signed-off-by: Fabio Graetz <[email protected]>
@fg91
Copy link
Member Author

fg91 commented Apr 23, 2023

Tests are failing because flyteidl needs to be released and updated here.

@fg91 fg91 marked this pull request as ready for review April 23, 2023 11:01
@fg91 fg91 closed this Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants