-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup_ci doesn't clean up auto-deploy IAM policy members #543
Comments
@gabrielwen can you give a link to an example of auto-deploy IAM policy members that should be GCed? |
an example looks like this: this might not necessarily needs to be purged as it might not be expired. that said, I think for auto-deployed accounts, we should simply check whether the deployment is still there and delete them if they are not. |
Thanks. Having more questions about the context of kfmaster auto
deployment, because I think the cleanup script is mainly for cleaning up
short-lived testing resource.
Is the goal of 'kf-master' to periodically deploy master? How often?
Can 'kf-master' completely delete previous deployment with all service
accounts and then deploy new version?
…On Thu, Dec 12, 2019 at 15:45 Hung-Ting Wen ***@***.***> wrote:
an example looks like this:
***@***.***
this might not necessarily needs to be purged as it might not be expired.
that said, I think for auto-deployed accounts, we should simply check
whether the deployment is still there and delete them if they are not.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#543?email_source=notifications&email_token=AB7BSKIVK7L6E4S6SKMFMHDQYLEIVA5CNFSM4J2C7UW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGYNOSY#issuecomment-565237579>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB7BSKO3B75UWRLWIIWVKE3QYLEIVANCNFSM4J2C7UWQ>
.
|
|
Clean up should include IAM bindings in project "kubeflow-ci-deployment" |
We are hitting the limit again each kubeflow/pipelines#712 Here is the current IAM policy It looks like the memberships that aren't being GC'd are of the form.
|
The logic to cleanup service accounts is here: testing/py/kubeflow/testing/cleanup_ci.py Line 782 in aabd75f
The logic is to trim unused bindings for any service accounts that don't exist. Here's the logic to GC service accounts testing/py/kubeflow/testing/cleanup_ci.py Line 782 in aabd75f
My suspicion is that the regex for the AutoDeploy patterns is not matching these names and so the emails are skipped. testing/py/kubeflow/testing/cleanup_ci.py Line 23 in aabd75f
|
Here's a list of service accounts currently in project: kubeflow-ci-deployment And here's the policy bindings kubeflow.ci.deployment.policy.txt It looks like the policy binding includes bindings for service accounts which don't exist and should have been GC'd. e.g.
So it looks like there might be a bug in the cleanup logic which should be removing bindings for non existent service accounts. |
Here are the logs from the most recent run of the cleanup ci script. It looks like the code to clean up bindings is invoked and no exceptions are reported but it also isn't printing out any info that would help us debug. |
I think the bug is here: testing/py/kubeflow/testing/cleanup_ci.py Line 770 in aabd75f
I don't think we need or want to check whether the pattern |
* We are not properly GC'ing policy bindings for deleted service accounts. * The problem is that we only consider service accounts matching a certain regex and that regex isn't matching service accounts for our auto-deployed clusters. * Using a regex should be unnecessary. If a service account doesn't exist that should be a sufficient criterion that the policy bindings should be deleted. Related to: kubeflow#543
* We are not properly GC'ing policy bindings for deleted service accounts. * The problem is that we only consider service accounts matching a certain regex and that regex isn't matching service accounts for our auto-deployed clusters. * Using a regex should be unnecessary. If a service account doesn't exist that should be a sufficient criterion that the policy bindings should be deleted. Related to: kubeflow#543
Here are the logs from a one off run. It looks like a bunch of old bindings were GC'd. Here's the current policy. kubeflow.ci.deployment.policy.txt Looks like a bunch of bindings were pruned. I think we can close this once #566 is merged. |
* Fix cleanup logic for IAM policy bindings * We are not properly GC'ing policy bindings for deleted service accounts. * The problem is that we only consider service accounts matching a certain regex and that regex isn't matching service accounts for our auto-deployed clusters. * Using a regex should be unnecessary. If a service account doesn't exist that should be a sufficient criterion that the policy bindings should be deleted. Related to: #543 * Fix typo. * Fix syntax issue.
Looks like I was overzealous and we deleted bindings for the service accounts created by GCP services. We will need to restore the IAM policies. Here's the policy for kubeflow-ci The policy for kubeflow-ci-deployment is available in the previous comment |
Going to restore the IAM policy for kubeflow ci Here's the current policy before the restore |
I rolled back the policy for kubeflow-ci; here's the current policy |
I rolled back the policy for kubeflow-ci-deployment Prior to the rollback the policy was I set the policy to This was necessary to restore the policy bindings to grant access to code running in the test cluster in kubeflow-ci to the clusters in kubeflow-ci-deployment. This is the script I used to generate the new policy from the policy in |
As a result of Tests would fail with permissions errors like this one
Hopefully this is fixed now. The test for kubeflow/manifests#707 is now running again so we will see. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions. |
This issue has been closed due to inactivity. |
thus we hit quota limit every week and have to manually clean it up.
The text was updated successfully, but these errors were encountered: