Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Warn when doing local torch elastic training with nnodes > 1 #1697

Merged
merged 1 commit into from
Jun 19, 2023

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jun 19, 2023

TL;DR

With @task(task_config=Elastic(...)) one can perform training with torch elastic launch (torchrun).
This works both locally as well as in a cluster with a kubeflow PyTorchJob.

When executing a workflow locally, i.e. python workflow.py, but setting e.g. Elastic(nnodes=2), the rendezvous of the workers will timeout because the workers wait for the non-existing workers from the non-existing 2nd node to join.

One would have to set the log level to debug in order to see that torch is waiting for the rendezvous to complete. By default, the workflow appears to not do anything.

I thins PR I add a warning log message that informs the user about this.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

I check for an environment variable that is set by the kubeflow training operator. If this is not set but the user set nnodes>1, the warning is emitted.

One could discuss whether we should just automatically switch to nnodes=1 if the environment variables for distributed training have not been set by the training operator but I found this too intrusive. Warning the user, however, should be done.

Tracking Issue

NA

Follow-up issue

NA

@fg91 fg91 merged commit 68ac1f5 into master Jun 19, 2023
@fg91 fg91 deleted the fg91/feat/warn-local-elastic-training branch June 19, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants