Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20.05] Change k8s runner to use k8s assigned uid to track jobs rather than name #10315

Closed

Conversation

innovate-invent
Copy link
Contributor

This is a competing PR with #10307 to solve the issue of colliding job names.

This uses the uid generated by k8s to track the job rather than any generated name by Galaxy. This guarantees no collision.

Hopefully this will resolve #10245.

Copy link
Member

@nuwang nuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job eliminating a chunk of code. This looks good to me. I think it would be good to squash and fix-up the commit history to a single commit so it's easier to cherry-pick etc.

@innovate-invent
Copy link
Contributor Author

I have tested this with a large workflow and it appears to function correctly. Unfortunately it does not resolve #10245.

Copy link
Member

@nuwang nuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

job.create()
job_id = job.labels.get(JOB_ID_LABEL, False)
if not job_id:
## Recover uid label because it wasn't set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented code can be removed? (should also fix linting failure).

Copy link
Contributor Author

@innovate-invent innovate-invent Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should retain this in the event that k8s stops adding the controller-uid tag. I think the tagging is a side effect of a feature where k8s autopopulates the job labels with the pod labels if they are not specified. The pod is actually the object being tagged by k8s with the controller-uid label.

kubernetes/kubernetes#32110

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the k8s job itself is not tracked? That would disallow the possibility of automatic retrials within k8s (ie. a node came down in the middle of the execution or you lost connection to it, which could happen on spot deployments).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh, no, you do get job_id there... I'm not sure I understand then.

Copy link
Contributor Author

@innovate-invent innovate-invent Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you shedule a Job in k8s it adds a controller-uid label with the Jobs uid to the Pod spec. It then generates the selector for the Job pods based on this label. Because the Job labels are ommitted k8s then copies all Pod labels including the controller-uid label it just added. I am exploiting this to allow selecting the Job on its uid. If k8s decided they diddnt like this side effect of copying the generated label and removed it then we would have to uncomment the code that recovers it.

Copy link
Member

@nuwang nuwang Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a limitation in pykube as far as I can see - it's a limitation in the k8s api (see this: kubernetes/kubernetes#76870 (comment))
The k8s resource is located based on the object name, not the uid: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/
As the docs state, the uid is meant to differentiate between historical occurrences of two objects with the same name, not as a resource locator.

Also, jobs from multiple galaxy instances in the same namespace do not collide because of this:

#k8s_galaxy_instance_id: my-instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pykube does not forward the interface to select on anything but labels. The interface to select on object fields is supported by k8s: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to point out that metadata.name is not a selector that pykube supports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@innovate-invent metadata.name does work as expected in pykube but has a different query syntax: https://github.com/galaxyproject/galaxy/pull/10307/files#diff-9deffce866ba96965f0de486716f0c4dR65
So I've gone ahead and cherry-picked your commits and included them in the original PR here: #10307 but with two modifications.

  1. The name is used as the selector
  2. The name is generated by k8s (using your idea of generateName) but with a custom prefix.
    These two changes should give us the best of both worlds I think - lets k8s generate and manage a unique name while we can still use the name as the selector, while avoiding the complications of relying on default label propagation, or not being able to set custom labels on a job. Let me know what you think.

@nsoranzo nsoranzo changed the title Change k8s runner to use k8s assigned uid to track jobs rather than name [20.05] Change k8s runner to use k8s assigned uid to track jobs rather than name Oct 7, 2020
@mvdbeek
Copy link
Member

mvdbeek commented Oct 14, 2020

We're not going to do such extensive changes to existing releases, but we're trying to get #10407 into 20.09

@mvdbeek mvdbeek closed this Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants