Optimize backups #677

listx · 2020-03-20T22:50:53Z

This implements the optimizations described in kubernetes/release#270 (comment) and #666.

/hold
/wip

listx · 2020-03-20T23:15:28Z

Looks like the secret was deleted (error from the pull-k8sio-backup test):

The pod could not start because it could not mount the volume "creds": secret "k8s-gcr-backup-test-prod-bak-service-account" not found

I'll take this setback as an opportunity to move to Workload Identity.

listx · 2020-03-21T00:16:56Z

I am going with kubernetes/test-infra#16883 first to see how this Workload Identity stuff plays out in a simpler context. The context there is simpler because we already actuated WI KSA empowerment stuff in #655. For us to use WI here, I would first have to add another special case for the promoter GCP SA (specifically, k8s-infra-gcr-promoter@k8s-gcr-backup-test-prod-bak.iam.gserviceaccount.com as per kubernetes/test-infra#15398 (comment)).

Once kubernetes/test-infra#16883 is merged, the other changes described above will follow. After all that, then the steps to re-enable backups are as follows:

(1) create another workload identity empowerment PR like #655 for k8s-infra-gcr-promoter@k8s-gcr-backup-test-prod-bak.iam.gserviceaccount.com
(2) update the pull-k8sio-backup job in test-infra repo to use Workload Identity
(3) re-test the the pull-k8sio-backup check to make sure we are not breaking any backup behaviors
(4) merge this PR
(5) re-enable the periodic backup job (ci-k8sio-backup; i.e., revert kubernetes/test-infra#16835) --- to run 1x per day to ensure we stay beneath quota limits (the initial runs will take hours and fail with a job timeout, most likely, because we would need to copy all 30K images for the very first time to each region). I can also run the backups on a separate machine to speed up the process.
(6) observe how long backups take once the initial set of images have been copied over; that is, how long would a NOP backup job (due to no delta) take? Experimentally we have seen numbers around the ~5min range
(7) if (6) shows ~5min, then we can increase the backup frequency back to hourly.

spiffxp · 2020-03-31T20:19:11Z

/approve
/lgtm
I leave the /hold cancel to you @listx

k8s-ci-robot · 2020-03-31T20:20:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: listx, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bartsmykla · 2020-04-01T04:20:31Z

/lgtm

listx · 2020-04-01T04:21:19Z

@spiffxp I need to wait for the workload identity stuff to get submitted first, namely #710 and kubernetes/test-infra#17048. At that point, I can use the pull-k8sio-backup presubmit check here to verify that WI is correctly working (and that gcrane can handle it --- which it should).

This PR is blocked by pull-k8sio-backup passing. (step 3 in my previous outline)

listx · 2020-04-02T01:21:53Z

/retest

This is because we will be using workload identity for these jobs, where the jobs will start out already authenticated as the GCP service accounts.

listx · 2020-04-02T04:35:03Z

/test pull-k8sio-backup

listx · 2020-04-02T07:45:00Z

@jonjohnsonjr I've moved pull-k8sio-backup to use Workload Identity instead of GOOGLE_APPLICATION_CREDENTIALS=... and it appears that even though it authenticates as the same service account it used to auth as ADC, it's not the same as ADC and thus failing.

For example I can see in the pull-k8sio-backup test's logs:

+ gcloud auth list
                               Credentialed Accounts
ACTIVE  ACCOUNT
*       k8s-infra-gcr-promoter@k8s-gcr-backup-test-prod-bak.iam.gserviceaccount.com

but it is failing with:

+ /home/prow/go/src/github.com/google/go-containerregistry/cmd/gcrane/gcrane ls -r us.gcr.io/k8s-gcr-backup-test-prod
+ xargs -n1 /home/prow/go/src/github.com/google/go-containerregistry/cmd/gcrane/gcrane delete
2020/04/02 07:13:14 No matching credentials were found, falling back on anonymous
2020/04/02 07:13:15 deleting us.gcr.io/k8s-gcr-backup-test-prod/artifact-promoter/cip@sha256:f61e2ff9d38993b0e8acd93a8d3d9fbafedf67d97845c244bf4d329aadf0e3c4: GET https://us.gcr.io/v2/token?scope=repository%3Ak8s-gcr-backup-test-prod%2Fartifact-promoter%2Fcip%3Apush%2Cpull&service=us.gcr.io: UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

I guess gcrane always requires ADC? Do you know if it is possible to make Workload Identity work with ADC?

listx · 2020-04-02T07:47:27Z

For more context the gcrane version was built from 3d03ed9b1ca2ad5d78d43832e8e46adc31d2b961 (master HEAD).

spiffxp · 2020-04-02T19:29:23Z

/lgtm

listx · 2020-04-02T19:32:16Z

/wip

Still working on resolving the auth issues post-WI...!

If gcrane fails to delete the images, then this loop might execute forever until timeout. Instead, fail after 5 attempts.

listx · 2020-04-02T19:59:20Z

Woohoo, it passed!

FTR the auth was not working for gcrane because we didn't do gcloud auth configure-docker. This is now necessary because we no longer pass in ADC creds as an env var.

/hold cancel

spiffxp · 2020-04-02T20:17:06Z

/lgtm

This gives `k8s-prow.svc.id.goog[test-pods/k8s-infra-gcr-promoter-bak]` access to authenticate as `$(svc_acct_email "${PRODBAK_PROJECT}" "${PROMOTER_SVCACCT}")`, which currently resolves to `k8s-infra-gcr-promoter@k8s-artifacts-prod-bak.iam.gserviceaccount.com`. This is a preparatory step before we can re-introduced the backup job that was optimized in kubernetes#677.

This empowers the `k8s-infra-gcr-promoter-bak` KSA in the `test-pods` K8s namespace in the `k8s-prow` GCP Project (where the Prow trusted cluster lives) to authenticate as `k8s-infra-gcr-promoter@k8s-artifacts-prod-bak.iam.gserviceaccount.com`. The `k8s-infra-gcr-promoter-bak` KSA does not exist yet and will be created when we re-introduce the backup job (pulled on 2020-03-18 due to quota issues). The backup job itself was optimized in kubernetes#677.

This reverts commit a6a2655. The backup job has received some optimizations in kubernetes/k8s.io#677, In addition, the k8s-artifacts-prod-bak GCR has been manually pre-populated with all ~30K images in k8s-artifacts-prod for all ASIA, EU, and US regions, which will result in jobs taking just minutes to run (as subsequent runs are mostly NOP runs). For more discussion on the backup job, please see https://docs.google.com/document/d/11eiosJvm2xEVUhPRU3-luANxxTPL5FqQdJXVrLPImyQ/edit?usp=sharing.

listx · 2020-04-09T18:48:58Z

Just following up on the work items:

(5) re-enable the periodic backup job (ci-k8sio-backup; i.e., revert kubernetes/test-infra#16835) --- to run 1x per day to ensure we stay beneath quota limits (the initial runs will take hours and fail with a job timeout, most likely, because we would need to copy all 30K images for the very first time to each region). I can also run the backups on a separate machine to speed up the process.

This was done here: kubernetes/test-infra#17150

(6) observe how long backups take once the initial set of images have been copied over; that is, how long would a NOP backup job (due to no delta) take? Experimentally we have seen numbers around the ~5min range

The first successful run, which backed up a handful of images, is here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-k8sio-backup/1248214661472456704

This took 19 minutes, which isn't too bad because, keep in mind that is for 3 regions, serially. So it took ~6-7 minutes per region. We could increase the speed by making the 3 copies parallel (using something like GNU parallel), but I don't think it's worth it (we have to most likely install it at the beginning of the job, because I doubt it's in the image we use).

(7) if (6) shows ~5min, then we can increase the backup frequency back to hourly.

According to this comment thread, for a NOP copy we would only be using ~3K requests to list all images for 1 region. If we did the job hourly, then there would be a minimum of 3K * 3 regions, or ~9K requests per hour, or 24 (hours) * 9K or ~216K requests to GCR per day. These are both well under the ~50K requests per 10 minutes, or 1000K requests per day quota limits here. However, the request count still feels heavy to me to be doing this hourly.

I think it's OK to leave it at 12h intervals for now.

backup_tools: use 10 threads explicitly

d7ace45

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. wg/k8s-infra labels Mar 20, 2020

k8s-ci-robot requested review from spiffxp and thockin March 20, 2020 22:51

listx mentioned this pull request Mar 31, 2020

k8s.gcr.io VDF (Vanity Domain Flip): Move official container images to K8s Infra kubernetes/release#270

Closed

12 tasks

k8s-ci-robot assigned spiffxp Mar 31, 2020

k8s-ci-robot assigned bartsmykla Apr 1, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2020

listx and others added 2 commits April 1, 2020 18:52

backup_tools: remove timestamp prefix

02b1d91

gcr backup: remove JSON creds dependency

df93036

This is because we will be using workload identity for these jobs, where the jobs will start out already authenticated as the GCP service accounts.

listx force-pushed the optimize-backups branch from 24a6d88 to df93036 Compare April 2, 2020 01:52

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020

update gcrane version

85b7fe8

listx force-pushed the optimize-backups branch from 389ebaf to fdf7b52 Compare April 2, 2020 06:56

listx force-pushed the optimize-backups branch from fdf7b52 to d7d8087 Compare April 2, 2020 07:04

listx force-pushed the optimize-backups branch from d7d8087 to 457b44e Compare April 2, 2020 19:25

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020

Linus Arver added 4 commits April 2, 2020 12:34

bump cip version

98740fb

clear_test_backup_repo: avoid infinite loop

d5f39cf

If gcrane fails to delete the images, then this loop might execute forever until timeout. Instead, fail after 5 attempts.

fix comments

922da23

sanity check gcloud auth

0393443

listx force-pushed the optimize-backups branch from 457b44e to 0393443 Compare April 2, 2020 19:34

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2020

k8s-ci-robot merged commit 7b50ca9 into kubernetes:master Apr 2, 2020

listx deleted the optimize-backups branch April 2, 2020 22:27

listx mentioned this pull request Apr 3, 2020

optimize k8s.gcr.io backup script #666

Closed

listx mentioned this pull request Apr 4, 2020

allow workload identity for prod backups #718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize backups #677

Optimize backups #677

listx commented Mar 20, 2020

listx commented Mar 20, 2020

listx commented Mar 21, 2020

spiffxp commented Mar 31, 2020

k8s-ci-robot commented Mar 31, 2020

bartsmykla commented Apr 1, 2020

listx commented Apr 1, 2020 •

edited

Loading

listx commented Apr 2, 2020

listx commented Apr 2, 2020

listx commented Apr 2, 2020

listx commented Apr 2, 2020

spiffxp commented Apr 2, 2020

listx commented Apr 2, 2020

listx commented Apr 2, 2020

spiffxp commented Apr 2, 2020

listx commented Apr 9, 2020

Optimize backups #677

Optimize backups #677

Conversation

listx commented Mar 20, 2020

listx commented Mar 20, 2020

listx commented Mar 21, 2020

spiffxp commented Mar 31, 2020

k8s-ci-robot commented Mar 31, 2020

bartsmykla commented Apr 1, 2020

listx commented Apr 1, 2020 • edited Loading

listx commented Apr 2, 2020

listx commented Apr 2, 2020

listx commented Apr 2, 2020

listx commented Apr 2, 2020

spiffxp commented Apr 2, 2020

listx commented Apr 2, 2020

listx commented Apr 2, 2020

spiffxp commented Apr 2, 2020

listx commented Apr 9, 2020

listx commented Apr 1, 2020 •

edited

Loading