forked from kubeflow/pipelines
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a playbook for dealing with cleaning up the test infrastructure. (k…
…ubeflow#338) * Add a playbook and describe how to deal with the CI infrastructure running out of GCP quota. * The cron/batch job for the CI system should not be pinned to checkout the code at PR 300; we should be using master. * We are seeing socket errors contacting the DM service so add some retries and in the event of permanent failure try to keep going. Related to: kubeflow#337
- Loading branch information
1 parent
c6295ec
commit e08595f
Showing
3 changed files
with
74 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Kubeflow Test Infrastructure Playbook | ||
|
||
This is a playbook for build cops to help deal with problems with the CI infrastructure. | ||
|
||
|
||
## GCP Quota errors | ||
|
||
1. List regional quotas to see which quotas are running hot | ||
|
||
``` | ||
gcloud compute regions describe --project=kubeflow-ci ${REGION} | ||
``` | ||
|
||
1. Check if we are leaking Kubeflow deployments and this is causing us to run out of quota. | ||
|
||
``` | ||
gcloud --project=kubeflow-ci --format="table(name,createTime:sort=1,location,status)" container clusters list | ||
gcloud --project=kubeflow-ci deployment-manager deployments list --format="table(name,insertTime:sort=1)" | ||
``` | ||
|
||
* Deployments created by the E2E tests should be GC'd after O(2) hours | ||
* So if there are resources older than O(2) hours it indicates that there is a problem with | ||
garbage collection | ||
|
||
1. Check if the cron job to GC resources is running in the test cluster | ||
|
||
``` | ||
kubectl get cronjobs | ||
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE | ||
cleanup-ci 0 */2 * * * False 0 <none> 14m | ||
``` | ||
|
||
* The cron job is defined in [cleanup-ci-cron.jsonnet](https://github.com/kubeflow/testing/blob/master/test-infra/ks_app/components/cleanup-ci-cron.jsonnet) | ||
|
||
* If the cron job is not configured then start it. | ||
|
||
|
||
1. Look for recent runs of the cron job and figure out whether the are running successfully | ||
|
||
``` | ||
kubectl get jobs | grep cleanup-ci | ||
``` | ||
|
||
* Jobs triggered by cron will match the regex `cleanup-ci-??????????` | ||
|
||
* Check that the job ran successfully | ||
|
||
* The pods associated with the job can be fetched via labels | ||
|
||
``` | ||
kubectl logs -l job-name=${JOBNAME} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters