-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add torchrun plugin #1576
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1576 +/- ##
=======================================
Coverage 69.92% 69.92%
=======================================
Files 319 319
Lines 29525 29525
Branches 5317 5317
=======================================
Hits 20644 20644
Misses 8365 8365
Partials 516 516 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
start_method=self.task_config.start_method, | ||
) | ||
|
||
if self.task_config.start_method == "spawn": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kumare3 what do you think of this workaround?
Unfortunately we can't just pass self._task_function
to elastic_launch
since it is not pickleable.
@kumare3 I also opened draft PRs in flyteidl and flyteplugins. Please note that these are super early WIP prototypes because I wanted to first hack everything together to derisk any potential deal breakers such as the limitations imposed by pickle I then stumbled upon (see this comment). I think this could be part of the existing kfpytorch plugin as an optional install In flytekit, this could look like this: @task(
task_config=PyTorch,
num_workers=2,
elastic_policy=ElasticPolicy( # <- optional
n_proc_per_node = ...,
....
),
),
)
def train(): If done this way, in flyteplugins most of the existing logic could be reused. We would only need a check whether the user configured an elastic policy and if so, add this to the PytorchJob object. I quickly hard-coded this in this draft PR. One last point: I think it would be amazing if this would be able to start a local process group when running locally. What do you think about this? |
Do you mind creating an issue in flyte mp and reference the issues in all related prs. By that way, we can have a central place to track the feature |
Closed in favor of #1583 |
TL;DR
Work in progress
This plugin allows running torch elastic (torchrun) distributed training with Flyte.
Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
https://github.com/flyteorg/flyte/issues/
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/