Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup_ci doesn't clean up auto-deploy IAM policy members #543

Closed
gabrielwen opened this issue Dec 12, 2019 · 18 comments
Closed

cleanup_ci doesn't clean up auto-deploy IAM policy members #543

gabrielwen opened this issue Dec 12, 2019 · 18 comments

Comments

@gabrielwen
Copy link
Contributor

thus we hit quota limit every week and have to manually clean it up.

@zhenghuiwang
Copy link
Contributor

@gabrielwen can you give a link to an example of auto-deploy IAM policy members that should be GCed?

@gabrielwen
Copy link
Contributor Author

an example looks like this: kf-vmaster-1211-099-user@kubeflow-ci-deployment.iam.gserviceaccount.com

this might not necessarily needs to be purged as it might not be expired. that said, I think for auto-deployed accounts, we should simply check whether the deployment is still there and delete them if they are not.

@zhenghuiwang
Copy link
Contributor

zhenghuiwang commented Dec 13, 2019 via email

@gabrielwen
Copy link
Contributor Author

Thanks. Having more questions about the context of kfmaster auto deployment, because I think the cleanup script is mainly for cleaning up short-lived testing resource. Is the goal of 'kf-master' to periodically deploy master? How often? Can 'kf-master' completely delete previous deployment with all service accounts and then deploy new version?

On Thu, Dec 12, 2019 at 15:45 Hung-Ting Wen @.> wrote: an example looks like this: @. this might not necessarily needs to be purged as it might not be expired. that said, I think for auto-deployed accounts, we should simply check whether the deployment is still there and delete them if they are not. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#543?email_source=notifications&email_token=AB7BSKIVK7L6E4S6SKMFMHDQYLEIVA5CNFSM4J2C7UW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGYNOSY#issuecomment-565237579>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7BSKO3B75UWRLWIIWVKE3QYLEIVANCNFSM4J2C7UWQ .

@kunmingg
Copy link
Contributor

Clean up should include IAM bindings in project "kubeflow-ci-deployment"
@jlewi @richardsliu

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

We are hitting the limit again each kubeflow/pipelines#712

Here is the current IAM policy
kubeflow.ci.deployment.policy.txt

It looks like the memberships that aren't being GC'd are of the form.

  - serviceAccount:kf-vmaster-0121-b11-user@kubeflow-ci-deployment.iam.gserviceaccount.com
  - serviceAccount:kf-vmaster-0121-ec5-user@kubeflow-ci-deployment.iam.gserviceaccount.com
  - serviceAccount:kf-vmaster-0121-fc4-user@kubeflow-ci-deployment.iam.gserviceaccount.com

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

The logic to cleanup service accounts is here:

def cleanup_service_account_bindings(args):

The logic is to trim unused bindings for any service accounts that don't exist.
So the bindings likely aren't deleted because the service accounts aren't GC'd.

Here's the logic to GC service accounts

def cleanup_service_account_bindings(args):

My suspicion is that the regex for the AutoDeploy patterns is not matching these names and so the emails are skipped.

AUTO_DEPLOY_PATTERNS = [re.compile(r".*kf-vmaster-(?!n\d\d)")]

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

Here's a list of service accounts currently in project: kubeflow-ci-deployment

serviceaccounts.list.txt

And here's the policy bindings

kubeflow.ci.deployment.policy.txt

It looks like the policy binding includes bindings for service accounts which don't exist and should have been GC'd.

e.g.

  - serviceAccount:kf-vmaster-0102-bef-user@kubeflow-ci-deployment.iam.gserviceaccount.com
  - serviceAccount:kf-vmaster-0102-daf-user@kubeflow-ci-deployment.iam.gserviceaccount.com
  - serviceAccount:kf-vmaster-0102-dcb-user@kubeflow-ci-deployment.iam.gserviceaccount.com
  - serviceAccount:kf-vmaster-0102-f2e-user@kubeflow-ci-deployment.iam.gserviceaccount.com

So it looks like there might be a bug in the cleanup logic which should be removing bindings for non existent service accounts.

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

Here are the logs from the most recent run of the cleanup ci script.
cleanup-ci-kubeflow-ci-deployment-1579629600-zlrtq.log.txt

It looks like the code to clean up bindings is invoked and no exceptions are reported but it also isn't printing out any info that would help us debug.

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

I think the bug is here:

if (not is_match(accountEmail)) or (accountEmail in accounts):

I don't think we need or want to check whether the pattern is_match. That check likely isn't matching the service accounts for our auto infra. The check that the service accounts exist should be sufficient.

jlewi pushed a commit to jlewi/testing that referenced this issue Jan 21, 2020
* We are not properly GC'ing policy bindings for deleted service accounts.

* The problem is that we only consider service accounts matching a certain
  regex and that regex isn't matching service accounts for our auto-deployed
  clusters.

* Using a regex should be unnecessary. If a service account doesn't exist
  that should be a sufficient criterion that the policy bindings should be
  deleted.

Related to: kubeflow#543
jlewi pushed a commit to jlewi/testing that referenced this issue Jan 21, 2020
* We are not properly GC'ing policy bindings for deleted service accounts.

* The problem is that we only consider service accounts matching a certain
  regex and that regex isn't matching service accounts for our auto-deployed
  clusters.

* Using a regex should be unnecessary. If a service account doesn't exist
  that should be a sufficient criterion that the policy bindings should be
  deleted.

Related to: kubeflow#543
@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

Here are the logs from a one off run.
oneoff-run.txt

It looks like a bunch of old bindings were GC'd.

Here's the current policy.

kubeflow.ci.deployment.policy.txt

Looks like a bunch of bindings were pruned.

I think we can close this once #566 is merged.

k8s-ci-robot pushed a commit that referenced this issue Jan 21, 2020
* Fix cleanup logic for IAM policy bindings

* We are not properly GC'ing policy bindings for deleted service accounts.

* The problem is that we only consider service accounts matching a certain
  regex and that regex isn't matching service accounts for our auto-deployed
  clusters.

* Using a regex should be unnecessary. If a service account doesn't exist
  that should be a sufficient criterion that the policy bindings should be
  deleted.

Related to: #543

* Fix typo.

* Fix syntax issue.
@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

Looks like I was overzealous and we deleted bindings for the service accounts created by GCP services.

We will need to restore the IAM policies.

Here's the policy for kubeflow-ci
kubeflow.ci.policy.txt

The policy for kubeflow-ci-deployment is available in the previous comment
#543 (comment)

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

Going to restore the IAM policy for kubeflow ci

Here's the current policy before the restore
kubeflow-ci.20200121-155208.yaml.txt

@jlewi
Copy link
Contributor

jlewi commented Jan 21, 2020

I rolled back the policy for kubeflow-ci;

here's the current policy
kubeflow-ci.20200121-155750.yaml.txt

@jlewi
Copy link
Contributor

jlewi commented Jan 22, 2020

I rolled back the policy for kubeflow-ci-deployment

Prior to the rollback the policy was
kubeflow-ci-deployment.20200121-173215.yaml.txt

I set the policy to
kubeflow.ci.deployment.policy.restored.txt

This was necessary to restore the policy bindings to grant access to code running in the test cluster in kubeflow-ci to the clusters in kubeflow-ci-deployment.

This is the script I used to generate the new policy from the policy in
#543 (comment)

https://github.com/jlewi/testing/blob/cleanup_bindings_restore/py/kubeflow/testing/fix_policy.py

@jlewi
Copy link
Contributor

jlewi commented Jan 22, 2020

As a result of
#543 (comment)

Tests would fail with permissions errors like this one

util.py                     72 INFO     time="2020-01-21T21:07:48Z" level=info msg="Creating deployment kfctl-23b8-storage" filename="gcp/gcp.go:462"
util.py                     72 INFO     Error: failed to apply:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 500 with message: gcp apply could not update deployment manager Error could not update s
torage-kubeflow.yaml; Insert deployment error: googleapi: Error 403: Required 'deploymentmanager.deployments.create' permission for 'projects/kubeflow-ci-deployment/global/deployments/kfctl-23b8-storage', forbidden

From
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_manifests/707/kubeflow-manifests-presubmit/1219724656517320704/

Hopefully this is fixed now. The test for kubeflow/manifests#707 is now running again so we will see.

@stale
Copy link

stale bot commented Apr 21, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Apr 28, 2020

This issue has been closed due to inactivity.

@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants