-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize backups #677
Optimize backups #677
Conversation
Looks like the secret was deleted (error from the
I'll take this setback as an opportunity to move to Workload Identity. |
I am going with kubernetes/test-infra#16883 first to see how this Workload Identity stuff plays out in a simpler context. The context there is simpler because we already actuated WI KSA empowerment stuff in #655. For us to use WI here, I would first have to add another special case for the promoter GCP SA (specifically, Once kubernetes/test-infra#16883 is merged, the other changes described above will follow. After all that, then the steps to re-enable backups are as follows: (1) create another workload identity empowerment PR like #655 for |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: listx, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@spiffxp I need to wait for the workload identity stuff to get submitted first, namely #710 and kubernetes/test-infra#17048. At that point, I can use the This PR is blocked by |
/retest |
This is because we will be using workload identity for these jobs, where the jobs will start out already authenticated as the GCP service accounts.
/test pull-k8sio-backup |
@jonjohnsonjr I've moved For example I can see in the
but it is failing with:
I guess |
For more context the gcrane version was built from 3d03ed9b1ca2ad5d78d43832e8e46adc31d2b961 (master HEAD). |
/lgtm |
/wip Still working on resolving the auth issues post-WI...! |
If gcrane fails to delete the images, then this loop might execute forever until timeout. Instead, fail after 5 attempts.
Woohoo, it passed! FTR the auth was not working for gcrane because we didn't do /hold cancel |
/lgtm |
This gives `k8s-prow.svc.id.goog[test-pods/k8s-infra-gcr-promoter-bak]` access to authenticate as `$(svc_acct_email "${PRODBAK_PROJECT}" "${PROMOTER_SVCACCT}")`, which currently resolves to `k8s-infra-gcr-promoter@k8s-artifacts-prod-bak.iam.gserviceaccount.com`. This is a preparatory step before we can re-introduced the backup job that was optimized in kubernetes#677.
This empowers the `k8s-infra-gcr-promoter-bak` KSA in the `test-pods` K8s namespace in the `k8s-prow` GCP Project (where the Prow trusted cluster lives) to authenticate as `k8s-infra-gcr-promoter@k8s-artifacts-prod-bak.iam.gserviceaccount.com`. The `k8s-infra-gcr-promoter-bak` KSA does not exist yet and will be created when we re-introduce the backup job (pulled on 2020-03-18 due to quota issues). The backup job itself was optimized in kubernetes#677.
This reverts commit a6a2655. The backup job has received some optimizations in kubernetes/k8s.io#677, In addition, the k8s-artifacts-prod-bak GCR has been manually pre-populated with all ~30K images in k8s-artifacts-prod for all ASIA, EU, and US regions, which will result in jobs taking just minutes to run (as subsequent runs are mostly NOP runs). For more discussion on the backup job, please see https://docs.google.com/document/d/11eiosJvm2xEVUhPRU3-luANxxTPL5FqQdJXVrLPImyQ/edit?usp=sharing.
Just following up on the work items:
This was done here: kubernetes/test-infra#17150
The first successful run, which backed up a handful of images, is here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-k8sio-backup/1248214661472456704 This took 19 minutes, which isn't too bad because, keep in mind that is for 3 regions, serially. So it took ~6-7 minutes per region. We could increase the speed by making the 3 copies parallel (using something like GNU parallel), but I don't think it's worth it (we have to most likely install it at the beginning of the job, because I doubt it's in the image we use).
According to this comment thread, for a NOP copy we would only be using ~3K requests to list all images for 1 region. If we did the job hourly, then there would be a minimum of 3K * 3 regions, or ~9K requests per hour, or 24 (hours) * 9K or ~216K requests to GCR per day. These are both well under the ~50K requests per 10 minutes, or 1000K requests per day quota limits here. However, the request count still feels heavy to me to be doing this hourly. I think it's OK to leave it at 12h intervals for now. |
This implements the optimizations described in kubernetes/release#270 (comment) and #666.
/hold
/wip