-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodeLabels from ResourceFlavor not added as Node-Selector values to Kubeflow PytorchJobs #1407
Comments
@tenzen-y have you observed this? |
I just double checked and we have integration tests for this. Can you share your PyTorchJob yaml? |
@alculquicondor can't share the full yaml, but the master and worker pytorch pods both have requests specified. See here |
Can you share the status of the Workload object associated with your PyTorchJob? |
If you're looking for a particular stanza, let me know and I can pull it. |
I actually wanted to see the flavor assignments.... But it looks like the flavors got something assigned, so no problem on that side. Definitely some problem passing the flavor information into the PyTorchJob. |
Do all resources (cpu, memory, etc) share the same flavor or are they different ones? Are both master and worker pods missing the node labels? |
Yep, single flavor. Everything that's redacted is just the single flavor name. Both pytorchjob master and worker pods are missing the node label. |
No, I haven't seen this issue never. |
/assign @tenzen-y |
@tenzen-y any luck replicating this behavior on your end? |
I'm trying to reproduce this. Please give me some time. |
@jrleslie I couldn't reproduce this issue. If you submit PyTorchJob with nodeSelecotr without a Kueue label, do pods have nodeSelectors? |
Also, which versions do you use the training-operator? |
Oh, I can reproduce this issue in the following steps: # 1. Create a new cluster
$ kind create cluster --image kindest/node:v1.24.13
# 2. Deploy Kueue
$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.5.1/manifests.yaml
# 3. Set up single clusterqueue
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/site/static/examples/admin/single-clusterqueue-setup.yaml
# 4. Add nodeSlecotor to resourceFlavor
$ cat <<EOF| kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
spec:
nodeLabels:
kubernetes.io/arch: arm64
EOF
# 5. Deploy training-operator
$ kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
# 6. Submit a PytorchJob
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/site/static/examples/jobs/sample-pytorchjob.yaml @jrleslie Once you use |
However, I could not reproduce this issue in any of our unit/integraiton tests :( |
We're using v1.70 for training-operator.
What is this a reference to? Is it a docker image or config that needs be set somewhere. Can't find it anywhere. |
This is a patch image for the kueue-controller-manager: You can verify this patch image resolve this issue once you modify the kueue-controller-manager deployment's image to this patch image. |
Thank you @tenzen-y. Will test and let you know shortly. |
Oh, these reproducing steps are invalid. If I used the following steps, no error occurred. # 1. Create a new cluster
$ kind create cluster --image kindest/node:v1.24.13
# 2. Deploy training-operator
# ################################################################################## #
# [IMPORTANT] We need to deploy training-operator before deploying kueue [IMPORTANT] #
# ################################################################################## #
$ kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
# 3. Deploy Kueue
$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.5.1/manifests.yaml
# 4. Set up single clusterqueue
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/site/static/examples/admin/single-clusterqueue-setup.yaml
# 5. Add nodeSlecotor to resourceFlavor
$ cat <<EOF| kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
spec:
nodeLabels:
kubernetes.io/arch: arm64
EOF
# 6. Submit a PytorchJob
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/site/static/examples/jobs/sample-pytorchjob.yaml |
@tenzen-y Is it possible there's a diff between the kueue helm chart and the direct manifest you are using in your steps above? You are correct in that your steps above show it working, which I validated. But when I attempted to use the helm chart, I see the same behavior where there are no Node-Selectors added to the pytorch job. Would you be able to test using the v0.5.1 helm chart? |
@jrleslie the key question is whether you installed kubeflow training-operator (more specifically, the CRDs) before installing kueue. |
yep - kubeflow training operator and crds are always installed prior to kueue being installed. I've tested this on a local kind cluster using @tenzen-y steps above and our ec2 based clusters and I replicated the behavior on both. When I replace the manifest install step with installation using the kueue helm chart, the node selectors are never applied to the pytorch jobs.
|
|
Can you share the first ~20 lines of the logs when the kueue-manager first starts? Any errors? |
There are two errors related to the certs that I see popping up in the log.
|
|
Not sure if related, but other problems related to the cert-rotator are being looked at #1445 (comment) That said, I don't see how that could be related to the pytorch reconciler. And why wouldn't jobs be affected. |
Do you see any log line saying Also, are you creating your pytorch jobs with the suspend field set? I have the feeling that these jobs are just bypassing kueue altogether. |
I found the helm chart doesn't have webhookConfigurations for kubeflowjobs. I will create a PR. |
uhm... I hope we can make the webhook files synchronization part of the checks, but I remember it was hard as we needed some overrides. Worth thinking about it after a hotfix |
I agree. We should add script in follow ups. |
Created: #1460 |
@jrleslie can you test the helm installation using this branch? https://github.com/kubernetes-sigs/kueue/tree/release-0.5/charts/kueue We can maybe do an official release right after New Year's. |
Yep - I can test on my side. Give me a few minutes. |
@alculquicondor @tenzen-y I think that may have resolved it. Ran the test scenario above end-to-end on both a local kind cluster and our ec2 based cluster and things look good with the node-selectors. I was originally seeing some inconsistencies with the crds on the ec2 based cluster, but I think that may have been caused by a race condition where crds were still being removed and running test scenarios too quickly back-to-back. |
/close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It's great to hear. Thanks. |
What happened:
When submitting a kubeflow pytorch job to a localqueue. The pods associated with pytorch job never get the
Node-Selectors
values specified via nodeLabels in the ResourceFlavor added to the pod. The Node-Selectors get set as<none>
.Fwiw I tested this with a batch/v1 job and the Node-Selectors were applied correctly.
What you expected to happen:
Pytorch job gets the proper Node-Selectors added to the associated pytorch master and worker pods.
How to reproduce it (as minimally and precisely as possible):
Submit kubeflow pytorch job to localqueue with an associated resourceflavor that has nodeLabels set in it.
Anything else we need to know?:
kueue-manager-config has the following enabled
Environment:
Kubernetes version (use
kubectl version
):Client Version: v1.24.13
Server Version: v1.24.13
Kueue version (use
git describe --tags --dirty --always
):v0.5.1-1-g69c236f-dirty
Cloud provider or hardware configuration:
AWS
OS (e.g:
cat /etc/os-release
):Kernel (e.g.
uname -a
):Install tools:
Others:
The text was updated successfully, but these errors were encountered: