Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of NFS dependency for in-cluster building #1798

Closed
edvald opened this issue Apr 17, 2020 · 5 comments · Fixed by #2372
Closed

Get rid of NFS dependency for in-cluster building #1798

edvald opened this issue Apr 17, 2020 · 5 comments · Fixed by #2372
Assignees
Labels
enhancement priority:high High priority issue or feature stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity.

Comments

@edvald
Copy link
Collaborator

edvald commented Apr 17, 2020

Background

One of the most frustrating issues with our remote building feature is the reliance on shareable (ReadWriteMany) volumes, which currently works through NFS at the moment, and optionally other RWX-capable volume provisioners such as EFS. This has turned out to cost a lot of maintenance burden and operational issues for our users.

The reason we currently have this requirement is that in order to sync code from the user's local build staging directory, we rsync to an in-cluster volume, that needs to be mountable by the in-cluster builder, whether that is an in-cluster docker daemon, or kaniko.

At the time we didn't see an alternative, that would be reasonably performant. We also had a performance goal in mind that may simply not be as important anymore. The raw efficiency of rsync is appealing, but in many (even most) cases the sync of code is only a small part of the overall build time.

Proposed solution

We can avoid this requirement altogether, through some added code complexity in the actual build flows, but in turn avoiding the complexity relating to managing finicky storage providers.

We still rsync over to the cluster, but instead of directly mounting the sync volume, we modify the flow to allow using a simpler RWO volume.

Build flow

Depending on the build mode, we do the following:

  • cluster-docker
    • We add an rsync container to the cluster-docker Pod spec, and instead of referencing a shared PVC for the build sync volume, we have an RWO PVC that the main docker container and the rsync container share.
    • When building, we do the exact same thing as previously, but instead of syncing to a separate build sync Pod, we sync to the rsync container in the cluster-docker Pod, before executing the build.
  • Kaniko
    • We still create a build sync Deployment and a PVC, but the PVC can now be RWO, and we only have one sync Pod, much like the in-cluster registry.
    • We still sync to the build-sync Pod ahead of the build.
    • In the Kaniko build Pod, we create an init container, that rsyncs from the build-sync Pod to the container filesystem, replacing the shared build-sync PVC reference with a shared emptyDir volume within the Pod.
    • The Kaniko build Pod does what it did previously.

Migration

We change the name of the build-sync service to build-sync-v2, and docker-daemon to docker-daemon-v2(and the helm release names accordingly). This is to avoid conflicts during rollout since they both function differently from the prior versions, and cannot cover both client versions simultaneously.

Users need to be instructed to remove the old build-sync and docker-daemon deployments and volumes manually, as well as the NFS provisioner, when their team has updated to the new version. Or uninstall and re-init completely, of course. We can print out a message to this effect in the cluster-init command.

The current storage.sync parameters still apply to the new build-sync-v2 volume without any changes.

We still always install build-sync-v2, even though it isn't necessary when using the cluster-docker build mode, in order to avoid headaches around using cluster-docker and kaniko in different scenarios on the same cluster.

If it is installed, the cleanup-cluster-registry script now also executes through the rsync container in the docker daemon Pod, in addition to the build-sync-v2 Pod.

Benefits

  1. Far simpler requirements for in-cluster building.
  2. Fewer bugs and support headaches.
  3. Less operation overhead for customers.
  4. New Kaniko flow can be made to work without even re-running cluster-init (falling back to older build-sync volume if it's available).

Drawbacks

  1. The two build modes change a bit and have somewhat different flows after the transition.
  2. There is a slightly cumbersome transition, having to instruct users to clean up older system services manually, and potentially in parallel for a bit.

Prioritization

I'd say sooner is better for this, since this causes annoying operational issues for users, and support issues for the Garden team by extension.

@DanielKillenberger
Copy link
Contributor

Has there been any progress on this issue?

We're still having quite some operational headaches to do with the nfs in our building flows. Having to clear images but then having the cleanup command fail which leads to us having to remove and remount volumes manually and a lot more minor problems that come up in between etc. etc.

If that dependency could be mitigated I think would help greatly.

@edvald edvald added priority:medium Medium priority issue or feature and removed priority:high High priority issue or feature labels Jul 2, 2020
@edvald
Copy link
Collaborator Author

edvald commented Jul 2, 2020

We unfortunately had to defer this until post 0.12, but now that 0.12 is out, I'm going to see again about getting this done.

@edvald edvald added priority:high High priority issue or feature and removed priority:medium Medium priority issue or feature labels Jul 2, 2020
@stale
Copy link

stale bot commented Aug 31, 2020

This issue has been automatically marked as stale because it hasn't had any activity in 60 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!

@stale stale bot added the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label Aug 31, 2020
@stale stale bot closed this as completed Sep 14, 2020
@edvald edvald reopened this Sep 14, 2020
@stale stale bot removed the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label Sep 14, 2020
@stale
Copy link

stale bot commented Dec 13, 2020

This issue has been automatically marked as stale because it hasn't had any activity in 90 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!

@stale stale bot added the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label Dec 13, 2020
@edvald edvald removed the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label Dec 13, 2020
@stale
Copy link

stale bot commented Mar 16, 2021

This issue has been automatically marked as stale because it hasn't had any activity in 90 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!

@stale stale bot added the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label Mar 16, 2021
@stale stale bot closed this as completed Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement priority:high High priority issue or feature stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants