Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kubeflow cluster environment #7300

Merged
merged 11 commits into from
May 17, 2021

Conversation

neggert
Copy link
Contributor

@neggert neggert commented Apr 30, 2021

What does this PR do?

Adds a ClusterEnvironment that works with the PyTorchJob operator in Kubeflow.

One open question from our discussion in Slack: is there a way to automatically tell if we're running inside a PyTorchJob? I've looked into this a bit, and I don't think there is. It's pretty easy to tell if we're in Kubernetes (various environment variables), but I haven't found a good way to check whether we're specifically running in a PyTorchJob. We could maybe query the Kubernetes API to find out, but that seems like overkill to me. Looking for feedback here.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Apr 30, 2021

Hello @neggert! Thanks for updating this PR.

Line 424:17: W503 line break before binary operator

Comment last updated at 2021-05-12 12:41:29 UTC

@codecov
Copy link

codecov bot commented Apr 30, 2021

Codecov Report

Merging #7300 (04c571c) into master (db54b30) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7300    +/-   ##
=======================================
- Coverage      92%     88%    -4%     
=======================================
  Files         199     200     +1     
  Lines       13067   13100    +33     
=======================================
- Hits        11997   11470   -527     
- Misses       1070    1630   +560     

@awaelchli
Copy link
Contributor

awaelchli commented Apr 30, 2021

Hey! Awesome

Just fyi here is where we select the environment:

https://github.com/PyTorchLightning/pytorch-lightning/blob/490cc57809ebeba19003b4101393a8a058217c31/pytorch_lightning/trainer/connectors/accelerator_connector.py#L482

And here the example of TorchElastic spying on Env variables to identify itself:
As you can see in TorchElasticEnvironment.is_using_torch_elastic() we spy on ENV variables:
https://github.com/PyTorchLightning/pytorch-lightning/blob/490cc57809ebeba19003b4101393a8a058217c31/pytorch_lightning/plugins/environments/torchelastic_environment.py#L29

Now, as you say we need to detect PyTorchJob. I don't know PyTorchJob, but I searched GitHub for os.environ query and found here https://github.com/kubeflow/pytorch-operator/search?q=os.environ there is a PYTORCHJOB_VERSION env variable. But I'm not sure if this one is guaranteed to be set. Or at minimum we can document that we autodetec IF this variable is present.

(btw the coverage is sometimes red because not yet all jobs have reported results (e.g. GPU still running)

@awaelchli awaelchli added the feature Is an improvement or enhancement label Apr 30, 2021
@awaelchli awaelchli added this to the v1.4 milestone Apr 30, 2021
@awaelchli awaelchli self-assigned this Apr 30, 2021
@neggert
Copy link
Contributor Author

neggert commented Apr 30, 2021

I'm not seeing PYTORCHJOB_VERSION get set in our Kubeflow install. It looks to me like that variable is used by the Python SDK, but it's not actually injected into the container itself.

Here's the code that injects variables into the pod containers: https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/pkg/controller.v1/pytorch/pod.go#L259

The only thing I can think would be to look for KUBERNETES_PORT, MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK to be set, but not GROUP_RANK, LOCAL_RANK, or LOCAL_WORLD_SIZE, since the latter would be set if running under TorchElastic. This seems like it would be brittle, but maybe I'm overthinking it.

@awaelchli
Copy link
Contributor

You think it may be brittle because you suspect there are other cluster managers that could be running with Kubernetes and be sharing the same env variables? I wouldn't know how to answer this but I could ask around. If we are unsure, it may be better to go with the manual selection / through argument parsing.

@neggert
Copy link
Contributor Author

neggert commented Apr 30, 2021

Yeah, that's pretty much the lines I'm thinking along. If you're comfortable with it, though, I'm happy to make the code change.

@awaelchli
Copy link
Contributor

yes sounds good !

Comment on lines 30 to 68
def creates_children(self) -> bool:
return True

def master_address(self) -> str:
return os.environ['MASTER_ADDR']

def master_port(self) -> int:
return int(os.environ['MASTER_PORT'])

def world_size(self) -> int:
return int(os.environ['WORLD_SIZE'])

def set_world_size(self, size: int) -> None:
log.debug("KubeflowEnvironment.set_world_size was called, but setting world size is not allowed. Ignored.")

def global_rank(self) -> int:
return int(os.environ["RANK"])

def set_global_rank(self, rank: int) -> None:
log.debug(
"KubeflowEnvironment.set_global_rank was called, but setting global rank is not allowed. Ignored."
)

def local_rank(self) -> int:
return 0

def node_rank(self) -> int:
return self.global_rank()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli just thinking why are they not as property?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not my decision. they were never properties from the very beginning and when I started working on these environments it was easier to keep the existing pattern.

I will refactor it, issue tracking here #6303

@neggert neggert force-pushed the kubeflow_environment branch from 84bd318 to cbdde3c Compare May 7, 2021 21:05
use_torchelastic_ddp or
use_kubeflow_ddp or
use_ddp_cpu_kubeflow
):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might be due for a refactor, but I just extended the existing code for now, since I'm not 100% sure of all the edge cases this might be handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's getting a bit ridiculous yes.
I have ideas how to simplify this in the future. But your addition looks solid!
So to summarize: The Kubeflow cluster environment is available with: DDPPlugin, DDPSpawnPlugin, user custom plugin, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will always auto-select DDPPlugin when in Kubeflow. Per my other comment, I'm not sure DDPSpawnPlugin makes much sense in this environment.

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great addition!!

use_torchelastic_ddp or
use_kubeflow_ddp or
use_ddp_cpu_kubeflow
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's getting a bit ridiculous yes.
I have ideas how to simplify this in the future. But your addition looks solid!
So to summarize: The Kubeflow cluster environment is available with: DDPPlugin, DDPSpawnPlugin, user custom plugin, correct?

Co-authored-by: Adrian Wälchli <[email protected]>
@neggert
Copy link
Contributor Author

neggert commented May 11, 2021

Any idea why that pre-commit check is failing? Pre-commit seems to run find on my end.

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@awaelchli awaelchli added the ready PRs ready to be merged label May 12, 2021
@carmocca
Copy link
Contributor

Any idea why that pre-commit check is failing? Pre-commit seems to run find on my end.

Unrelated, opened #7500 to fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants