-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[20.05] Change k8s runner to use k8s assigned uid to track jobs rather than name #10315
[20.05] Change k8s runner to use k8s assigned uid to track jobs rather than name #10315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job eliminating a chunk of code. This looks good to me. I think it would be good to squash and fix-up the commit history to a single commit so it's easier to cherry-pick etc.
I have tested this with a large workflow and it appears to function correctly. Unfortunately it does not resolve #10245. |
7e522ff
to
f626e6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
job.create() | ||
job_id = job.labels.get(JOB_ID_LABEL, False) | ||
if not job_id: | ||
## Recover uid label because it wasn't set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented code can be removed? (should also fix linting failure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should retain this in the event that k8s stops adding the controller-uid tag. I think the tagging is a side effect of a feature where k8s autopopulates the job labels with the pod labels if they are not specified. The pod is actually the object being tagged by k8s with the controller-uid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the k8s job itself is not tracked? That would disallow the possibility of automatic retrials within k8s (ie. a node came down in the middle of the execution or you lost connection to it, which could happen on spot deployments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh, no, you do get job_id there... I'm not sure I understand then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you shedule a Job in k8s it adds a controller-uid label with the Jobs uid to the Pod spec. It then generates the selector for the Job pods based on this label. Because the Job labels are ommitted k8s then copies all Pod labels including the controller-uid label it just added. I am exploiting this to allow selecting the Job on its uid. If k8s decided they diddnt like this side effect of copying the generated label and removed it then we would have to uncomment the code that recovers it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a limitation in pykube as far as I can see - it's a limitation in the k8s api (see this: kubernetes/kubernetes#76870 (comment))
The k8s resource is located based on the object name, not the uid: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/
As the docs state, the uid is meant to differentiate between historical occurrences of two objects with the same name, not as a resource locator.
Also, jobs from multiple galaxy instances in the same namespace do not collide because of this:
#k8s_galaxy_instance_id: my-instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pykube does not forward the interface to select on anything but labels. The interface to select on object fields is supported by k8s: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata.uid
is not a valid field selector though: https://stackoverflow.com/questions/59442065/how-to-use-kubernetes-fieldselector-to-query-ownerreferences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to point out that metadata.name is not a selector that pykube supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@innovate-invent metadata.name
does work as expected in pykube but has a different query syntax: https://github.com/galaxyproject/galaxy/pull/10307/files#diff-9deffce866ba96965f0de486716f0c4dR65
So I've gone ahead and cherry-picked your commits and included them in the original PR here: #10307 but with two modifications.
- The name is used as the selector
- The name is generated by k8s (using your idea of generateName) but with a custom prefix.
These two changes should give us the best of both worlds I think - lets k8s generate and manage a unique name while we can still use the name as the selector, while avoiding the complications of relying on default label propagation, or not being able to set custom labels on a job. Let me know what you think.
We're not going to do such extensive changes to existing releases, but we're trying to get #10407 into 20.09 |
This is a competing PR with #10307 to solve the issue of colliding job names.
This uses the uid generated by k8s to track the job rather than any generated name by Galaxy. This guarantees no collision.
Hopefully this will resolve #10245.