automatically remove user registry secrets #333

rokroskar · 2020-06-10T15:06:02Z

In #327 we introduced a mechanism to use the user's own authentication token for pulling images from private repositories. We want to minimize the amount of time these credentials are left around so we need some mechanism for removing them.

Some ideas:

remove secrets older than x minutes/hours
use a naming scheme such that a secret is made per session launch - then, the secret can be cleaned immediately after the container is done spawning
others?

olevski · 2020-10-12T11:53:07Z

@ableuler after our brief discussion today, here are our options to tackle this:

A. Add a preStop hook on the pod that runs the jupyterhub server to delete the token once the pod is shut down

Pros:
- A small amount of code is added, it can be as simple as an API call to the k8s api from the pod to delete the specific secret
- It happens automatically exactly when we needed regardless of how the pod is being destroyed (either because of inactivity or because the user decided to do so through the UI).
Cons:
- The kubernetes API is not accessible from within the pod with this new egress policy on the user pods (feat: restrict user pod egress #430)
- Even if access to the k8s API was available we would need to inject a service account with proper permissions into the pod so that the pod can use this to authenticate with the k8s API and delete the image pull secret. By default no such service account is injected and such a service account does not currently exist. To make this fully operational I think we would have to make a service account and role binding for every user. The problem is that we do not have a separate namespace for every user so we cannot make a Role that gives a user access only to their own secrets. I think that unless we decide to put every user's jupyterhub pods in a separate namespace this option is not possible.

B. Add a cronjob that will look through the secrets and delete old ones that are not tied to a pod that runs jupyterhub

Pros:
- One cronjob takes care of all users in a specific deployment
- The above-mentioned issues with changing the egress policy, inserting the service account in the pod and adding Roles is fully avoided because the cronjob does not have to operate from inside the jupyterhub user pod.
Cons:
- There will be some delay when unused secrets are deleted. I.e. the cronjob can operate every hour and delete the secrets that are older than X hours and that do not have an actively running pod that is associated with them. So immediately after a user jupyterhub pod is deleted it will take some time for the user's image pull secret to be deleted.

Let me know what you think. Hopefully I did not miss any major considerations in writing this up. I tried to access the k8s api from a running pod and the requests would time out as long as the egress networkPolicy from #430 is active. When I delete this egress policy then I can successfully reach the k8s API.

I am not sure if this warrants a further/wider discussion about how much access to the k8s API we give to users and how we control/restrict this.

ableuler · 2020-10-12T12:39:57Z

Thanks @olevski for laying out the options. I think option B is better. I don't mind secrets being around a bit longer than needed as long as they are eventually cleaned up.

rokroskar · 2020-10-12T13:31:50Z

Under Option A there is also the possibility to have the secret around way longer than necessary - it's only needed at launch time, and the time between launch and the preStop hook being executed could be days or even weeks. The cron job running in ~30 minute intervals on the other hand means there is always a clear worst-case scenario.

ableuler · 2020-10-12T13:45:18Z

@rokroskar We actually discussed this: I guess we definitely need the secret until the pod has been assigned to a node (which can take a while in the case of insufficient resources). Do you know if currently k8s will reschedule a user pod on a different node in case of a node failure?

rokroskar · 2020-10-12T13:47:07Z

Right, so the secret culling process has to check whether the pod that might need the secret is actually up and running already or not, to avoid the situation where the secret would be removed before the credentials are actually used.

AFAIK if the node fails the pod is gone.

ableuler · 2020-10-12T13:50:30Z

AFAIK if the node fails the pod is gone.

True - I actually hope so because all the ephemeral disk space would be gone anyway...

rokroskar mentioned this issue Jun 10, 2020

use user credentials for pulling images #327

Merged

ableuler added this to the sprint-2020-07-09 milestone Jul 8, 2020

ableuler modified the milestones: sprint-2020-07-09, sprint-2020-08-20 Aug 19, 2020

ableuler modified the milestones: sprint-2020-08-20, sprint-2020-07-09, sprint-2020-09-10 Sep 10, 2020

ableuler modified the milestones: sprint-2020-09-10, sprint-2020-10-01 Oct 1, 2020

ableuler added devops enhancement New feature or request labels Oct 1, 2020

olevski self-assigned this Oct 9, 2020

olevski mentioned this issue Oct 18, 2020

feat: automatically remove user registry secrets #435

Merged

olevski closed this as completed in #435 Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatically remove user registry secrets #333

automatically remove user registry secrets #333

rokroskar commented Jun 10, 2020

olevski commented Oct 12, 2020

ableuler commented Oct 12, 2020

rokroskar commented Oct 12, 2020

ableuler commented Oct 12, 2020

rokroskar commented Oct 12, 2020

ableuler commented Oct 12, 2020

automatically remove user registry secrets #333

automatically remove user registry secrets #333

Comments

rokroskar commented Jun 10, 2020

olevski commented Oct 12, 2020

ableuler commented Oct 12, 2020

rokroskar commented Oct 12, 2020

ableuler commented Oct 12, 2020

rokroskar commented Oct 12, 2020

ableuler commented Oct 12, 2020