optimize k8s.gcr.io backup script #666

listx · 2020-03-18T22:50:17Z

We are exceeding quota for GCR by doing a simple copy of ~30K images in each of the 3 regions (for a total of ~90K image copies).

We have spoken with GCR and there is currently no way of giving special quota privileges to particular GCP projects. While this might be worth pursuing (simply increasing quota), I think we can just work on optimizing the backup scripts. There are a couple reasons for this:

We currently do a simple gcrane cp -r <prod> <prod-backup>/<timestamp> and while this works, it is slow (on the order of taking hours to complete), even if there are 0 changes from the point of the last backup!
The prod GCR (k8s-artifacts-prod) will be immutable (the promoter does not allow mutations), so essentially it is an ever-growing-GCR. Given this nature, we can just make sure that the backup always just adds a delta of new images since the last snapshot, as the old backed-up copies will never change or be removed.

I'll come up with a more detailed design soon.

/assign @listx

The text was updated successfully, but these errors were encountered:

stp-ip · 2020-03-19T11:21:59Z

First suggestion would be to only backup a single region and add a sync-check separately. We should be checking that all regions are in sync anyway.

Is copying over into the same prefix/timestamp faster?

We could copy into the same location gcrane cp -r / and set a long term retention to prevent modifications, but that goes against the GCR teams recommendation to assume or work with anything in GCS, but we already do this for prod afaik so could do here as well.

Splitting or stretching the backup window had issues as gcrane doesn't allow rate-limits afaik and with running into quotas we block updates to say prow.

listx · 2020-03-19T23:51:19Z

First suggestion would be to only backup a single region and add a sync-check separately. We should be checking that all regions are in sync anyway.

Is copying over into the same prefix/timestamp faster?

How do you mean, exactly? (You can take a look here for some examples of timestamped prefixes.)

We could copy into the same location gcrane cp -r / and set a long term retention to prevent modifications, but that goes against the GCR teams recommendation to assume or work with anything in GCS, but we already do this for prod afaik so could do here as well.

I'd rather not deal with GCS (implementation) details. We have been advised a number of times by the GCR team to treat GCS as opaque.

Splitting or stretching the backup window had issues as gcrane doesn't allow rate-limits afaik and with running into quotas we block updates to say prow.

I'm working on a design with some ideas; will share soon. Stay tuned.

listx · 2020-03-20T00:27:18Z

Is copying over into the same prefix/timestamp faster?

Sorry, I think I understand your comment now. Last time I checked, gcrane does not work any faster even if you copy into the same prefix. It takes a long time (hours) for it to traverse over the ~30K images to realize that nothing needs to be copied over. This is different than the promoter which runs much faster here due to it aggressively ignoring images that have already been promoted.

listx · 2020-03-20T00:50:28Z

I drafted a doc here: https://docs.google.com/document/d/11eiosJvm2xEVUhPRU3-luANxxTPL5FqQdJXVrLPImyQ/edit?usp=sharing

listx · 2020-04-03T23:34:04Z

This was fixed with #677

/close

k8s-ci-robot · 2020-04-03T23:34:26Z

@listx: Closing this issue.

In response to this:

This was fixed with #677

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

listx added the wg/k8s-infra label Mar 18, 2020

k8s-ci-robot assigned listx Mar 18, 2020

listx changed the title ~~rework k8s.gcr.io backup script~~ optimize k8s.gcr.io backup script Mar 18, 2020

This was referenced Mar 18, 2020

k8s.gcr.io VDF (Vanity Domain Flip): Move official container images to K8s Infra kubernetes/release#270

Closed

[Failing Test] ci-k8sio-backup (ci-k8sio-backup) kubernetes/kubernetes#88802

Closed

listx mentioned this issue Mar 20, 2020

Optimize backups #677

Merged

k8s-ci-robot closed this as completed Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize k8s.gcr.io backup script #666

optimize k8s.gcr.io backup script #666

listx commented Mar 18, 2020

stp-ip commented Mar 19, 2020

listx commented Mar 19, 2020

listx commented Mar 20, 2020

listx commented Mar 20, 2020

listx commented Apr 3, 2020

k8s-ci-robot commented Apr 3, 2020

optimize k8s.gcr.io backup script #666

optimize k8s.gcr.io backup script #666

Comments

listx commented Mar 18, 2020

stp-ip commented Mar 19, 2020

listx commented Mar 19, 2020

listx commented Mar 20, 2020

listx commented Mar 20, 2020

listx commented Apr 3, 2020

k8s-ci-robot commented Apr 3, 2020