Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize k8s.gcr.io backup script #666

Closed
listx opened this issue Mar 18, 2020 · 6 comments
Closed

optimize k8s.gcr.io backup script #666

listx opened this issue Mar 18, 2020 · 6 comments
Assignees

Comments

@listx
Copy link
Contributor

listx commented Mar 18, 2020

We are exceeding quota for GCR by doing a simple copy of ~30K images in each of the 3 regions (for a total of ~90K image copies).

We have spoken with GCR and there is currently no way of giving special quota privileges to particular GCP projects. While this might be worth pursuing (simply increasing quota), I think we can just work on optimizing the backup scripts. There are a couple reasons for this:

  • We currently do a simple gcrane cp -r <prod> <prod-backup>/<timestamp> and while this works, it is slow (on the order of taking hours to complete), even if there are 0 changes from the point of the last backup!
  • The prod GCR (k8s-artifacts-prod) will be immutable (the promoter does not allow mutations), so essentially it is an ever-growing-GCR. Given this nature, we can just make sure that the backup always just adds a delta of new images since the last snapshot, as the old backed-up copies will never change or be removed.

I'll come up with a more detailed design soon.

/assign @listx

@stp-ip
Copy link
Member

stp-ip commented Mar 19, 2020

First suggestion would be to only backup a single region and add a sync-check separately. We should be checking that all regions are in sync anyway.

Is copying over into the same prefix/timestamp faster?

We could copy into the same location gcrane cp -r / and set a long term retention to prevent modifications, but that goes against the GCR teams recommendation to assume or work with anything in GCS, but we already do this for prod afaik so could do here as well.

Splitting or stretching the backup window had issues as gcrane doesn't allow rate-limits afaik and with running into quotas we block updates to say prow.

@listx
Copy link
Contributor Author

listx commented Mar 19, 2020

First suggestion would be to only backup a single region and add a sync-check separately. We should be checking that all regions are in sync anyway.

Is copying over into the same prefix/timestamp faster?

How do you mean, exactly? (You can take a look here for some examples of timestamped prefixes.)

We could copy into the same location gcrane cp -r / and set a long term retention to prevent modifications, but that goes against the GCR teams recommendation to assume or work with anything in GCS, but we already do this for prod afaik so could do here as well.

I'd rather not deal with GCS (implementation) details. We have been advised a number of times by the GCR team to treat GCS as opaque.

Splitting or stretching the backup window had issues as gcrane doesn't allow rate-limits afaik and with running into quotas we block updates to say prow.

I'm working on a design with some ideas; will share soon. Stay tuned.

@listx
Copy link
Contributor Author

listx commented Mar 20, 2020

Is copying over into the same prefix/timestamp faster?

Sorry, I think I understand your comment now. Last time I checked, gcrane does not work any faster even if you copy into the same prefix. It takes a long time (hours) for it to traverse over the ~30K images to realize that nothing needs to be copied over. This is different than the promoter which runs much faster here due to it aggressively ignoring images that have already been promoted.

@listx
Copy link
Contributor Author

listx commented Mar 20, 2020

@listx listx mentioned this issue Mar 20, 2020
@listx
Copy link
Contributor Author

listx commented Apr 3, 2020

This was fixed with #677

/close

@k8s-ci-robot
Copy link
Contributor

@listx: Closing this issue.

In response to this:

This was fixed with #677

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants