Skip to content

Commit

Permalink
Add a playbook for dealing with cleaning up the test infrastructure. (k…
Browse files Browse the repository at this point in the history
…ubeflow#338)

* Add a playbook and describe how to deal with the CI infrastructure running
  out of GCP quota.

* The cron/batch job for the CI system should not be pinned to checkout
  the code at PR 300; we should be using master.

* We are seeing socket errors contacting the DM service so add some retries
  and in the event of permanent failure try to keep going.

Related to: kubeflow#337
  • Loading branch information
jlewi authored and k8s-ci-robot committed Mar 27, 2019
1 parent c6295ec commit e08595f
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 10 deletions.
52 changes: 52 additions & 0 deletions playbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Kubeflow Test Infrastructure Playbook

This is a playbook for build cops to help deal with problems with the CI infrastructure.


## GCP Quota errors

1. List regional quotas to see which quotas are running hot

```
gcloud compute regions describe --project=kubeflow-ci ${REGION}
```

1. Check if we are leaking Kubeflow deployments and this is causing us to run out of quota.

```
gcloud --project=kubeflow-ci --format="table(name,createTime:sort=1,location,status)" container clusters list
gcloud --project=kubeflow-ci deployment-manager deployments list --format="table(name,insertTime:sort=1)"
```

* Deployments created by the E2E tests should be GC'd after O(2) hours
* So if there are resources older than O(2) hours it indicates that there is a problem with
garbage collection

1. Check if the cron job to GC resources is running in the test cluster

```
kubectl get cronjobs
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cleanup-ci 0 */2 * * * False 0 <none> 14m
```

* The cron job is defined in [cleanup-ci-cron.jsonnet](https://github.com/kubeflow/testing/blob/master/test-infra/ks_app/components/cleanup-ci-cron.jsonnet)

* If the cron job is not configured then start it.


1. Look for recent runs of the cron job and figure out whether the are running successfully

```
kubectl get jobs | grep cleanup-ci
```

* Jobs triggered by cron will match the regex `cleanup-ci-??????????`

* Check that the job ran successfully

* The pods associated with the job can be fetched via labels

```
kubectl logs -l job-name=${JOBNAME}
```
24 changes: 22 additions & 2 deletions py/kubeflow/testing/cleanup_ci.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
import logging
import os
import re
import retrying
import socket
import subprocess
import tempfile
import yaml
Expand Down Expand Up @@ -43,6 +45,11 @@ def is_match(name, patterns=None):

return False

def is_retryable_exception(exception):
"""Return True if we consider the exception retryable"""
# Socket errors look like temporary problems connecting to GCP.
return isinstance(exception, socket.error)

def cleanup_workflows(args):
# We need to load the kube config so that we can have credentials to
# talk to the APIServer.
Expand Down Expand Up @@ -347,6 +354,11 @@ def getAge(tsInRFC3339):
age = datetime.datetime.utcnow()- insert_time_utc
return age

@retrying.retry(stop_max_attempt=5,
retry_on_exception=is_retryable_exception)
def execute_rpc(rpc):
"""Execute a Google RPC request with retries."""
return rpc.execute()

def cleanup_deployments(args): # pylint: disable=too-many-statements,too-many-branches
if not args.delete_script:
Expand Down Expand Up @@ -382,8 +394,16 @@ def cleanup_deployments(args): # pylint: disable=too-many-statements,too-many-br
else:
manifest_url = d["manifest"]
manifest_name = manifest_url.split("/")[-1]
manifest = manifests_client.get(
project=args.project, deployment=name, manifest=manifest_name).execute()

rpc = manifests_client.get(project=args.project,
deployment=name,
manifest=manifest_name)
try:
manifest = execute_rpc(rpc)
except socket.error as e:
logging.error("socket error prevented getting manifest %s", e)
# Try to continue with deletion rather than aborting.
continue

# Create a temporary directory to store the deployment.
manifest_dir = tempfile.mkdtemp(prefix="tmp" + name)
Expand Down
8 changes: 0 additions & 8 deletions test-infra/ks_app/components/cleanup-ci.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,6 @@
name: "REPO_NAME",
value: "testing",
},
{
// TODO(jlewi): Stop setting PULL_NUMBER once the PR is merged.
// We had to set the PR number because when we initially created the
// job we had some changes to cleanup_ci.py that were part of the PR
// committing the job.
name: "PULL_NUMBER",
value: "300",
},
{
name: "PYTHONPATH",
value: "/src/kubeflow/testing/py",
Expand Down

0 comments on commit e08595f

Please sign in to comment.