-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kubeflow cluster environment #7300
Conversation
Hello @neggert! Thanks for updating this PR.
Comment last updated at 2021-05-12 12:41:29 UTC |
Codecov Report
@@ Coverage Diff @@
## master #7300 +/- ##
=======================================
- Coverage 92% 88% -4%
=======================================
Files 199 200 +1
Lines 13067 13100 +33
=======================================
- Hits 11997 11470 -527
- Misses 1070 1630 +560 |
Hey! Awesome Just fyi here is where we select the environment: And here the example of TorchElastic spying on Env variables to identify itself: Now, as you say we need to detect PyTorchJob. I don't know PyTorchJob, but I searched GitHub for os.environ query and found here https://github.com/kubeflow/pytorch-operator/search?q=os.environ there is a PYTORCHJOB_VERSION env variable. But I'm not sure if this one is guaranteed to be set. Or at minimum we can document that we autodetec IF this variable is present. (btw the coverage is sometimes red because not yet all jobs have reported results (e.g. GPU still running) |
I'm not seeing Here's the code that injects variables into the pod containers: https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/pkg/controller.v1/pytorch/pod.go#L259 The only thing I can think would be to look for |
You think it may be brittle because you suspect there are other cluster managers that could be running with Kubernetes and be sharing the same env variables? I wouldn't know how to answer this but I could ask around. If we are unsure, it may be better to go with the manual selection / through argument parsing. |
Yeah, that's pretty much the lines I'm thinking along. If you're comfortable with it, though, I'm happy to make the code change. |
yes sounds good ! |
def creates_children(self) -> bool: | ||
return True | ||
|
||
def master_address(self) -> str: | ||
return os.environ['MASTER_ADDR'] | ||
|
||
def master_port(self) -> int: | ||
return int(os.environ['MASTER_PORT']) | ||
|
||
def world_size(self) -> int: | ||
return int(os.environ['WORLD_SIZE']) | ||
|
||
def set_world_size(self, size: int) -> None: | ||
log.debug("KubeflowEnvironment.set_world_size was called, but setting world size is not allowed. Ignored.") | ||
|
||
def global_rank(self) -> int: | ||
return int(os.environ["RANK"]) | ||
|
||
def set_global_rank(self, rank: int) -> None: | ||
log.debug( | ||
"KubeflowEnvironment.set_global_rank was called, but setting global rank is not allowed. Ignored." | ||
) | ||
|
||
def local_rank(self) -> int: | ||
return 0 | ||
|
||
def node_rank(self) -> int: | ||
return self.global_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli just thinking why are they not as property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not my decision. they were never properties from the very beginning and when I started working on these environments it was easier to keep the existing pattern.
I will refactor it, issue tracking here #6303
84bd318
to
cbdde3c
Compare
for more information, see https://pre-commit.ci
use_torchelastic_ddp or | ||
use_kubeflow_ddp or | ||
use_ddp_cpu_kubeflow | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it might be due for a refactor, but I just extended the existing code for now, since I'm not 100% sure of all the edge cases this might be handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's getting a bit ridiculous yes.
I have ideas how to simplify this in the future. But your addition looks solid!
So to summarize: The Kubeflow cluster environment is available with: DDPPlugin, DDPSpawnPlugin, user custom plugin, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will always auto-select DDPPlugin
when in Kubeflow. Per my other comment, I'm not sure DDPSpawnPlugin
makes much sense in this environment.
…htning into kubeflow_environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great addition!!
use_torchelastic_ddp or | ||
use_kubeflow_ddp or | ||
use_ddp_cpu_kubeflow | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's getting a bit ridiculous yes.
I have ideas how to simplify this in the future. But your addition looks solid!
So to summarize: The Kubeflow cluster environment is available with: DDPPlugin, DDPSpawnPlugin, user custom plugin, correct?
Co-authored-by: Adrian Wälchli <[email protected]>
Any idea why that pre-commit check is failing? Pre-commit seems to run find on my end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Unrelated, opened #7500 to fix it |
What does this PR do?
Adds a
ClusterEnvironment
that works with the PyTorchJob operator in Kubeflow.One open question from our discussion in Slack: is there a way to automatically tell if we're running inside a PyTorchJob? I've looked into this a bit, and I don't think there is. It's pretty easy to tell if we're in Kubernetes (various environment variables), but I haven't found a good way to check whether we're specifically running in a PyTorchJob. We could maybe query the Kubernetes API to find out, but that seems like overkill to me. Looking for feedback here.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃