Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run all PRs on GH Actions VMs #4033

Merged
merged 9 commits into from
Feb 12, 2020
Merged

Run all PRs on GH Actions VMs #4033

merged 9 commits into from
Feb 12, 2020

Conversation

kleimkuhler
Copy link
Contributor

Motivation

Currently all pushes to master branch, tags, and Linkerd org member PRs run
the kind_integration_host job on the same Packet host.

The means that parallel jobs spin up KinD clusters with a unique name and
sandbox the tests so that they do not clash.

This is problematic for a few reasons:

  • There is a limit on the number of jobs we can run in parallel due to
    resource constraints.
  • Workflow cancellation and re-runs conflict when the cancelled run deletes
    it's namespaces and the running one expects them to be present.
  • There has been an observed flakiness with running multiple KinD clusters
    resulting in inconsistent timeouts and docker errors.

Solution

This change moves all KinD integration testing to GH Actions VMs. This is
currently what forked repository workflows do.

There is no longer a docker_pull job as it's responsibilities has been moved
into one of the kind_integration_tests steps.

The renamed kind_integration_tests job is responsible for all PR
workflows and has steps specific to forked and non-forked repositories.

Non-forked repository PRs

The Packet host is still used for building docker images as leveraging docker
layer caching is still valuable--a build can be as fast as 30 seconds compared
to ~12 minutes.

Loading the docker images into the KinD cluster on the GH Action VM is done by
saving the Packet host docker images as image archives, and loading those
directly into the local KinD cluster.

Forked repository PRs

docker_build has been sped up slightly by sending docker save processes to
the background.

Docker layer caching cannot be leveraged since there are no SSH secrets
available, so the artifact-upload/artifact-download actions introduced in
#TODO are still used.

Cleanup

This PR also includes some general cleanup such as:

  • Some job names have been renamed to better reflect their purpose or match
    the current naming pattern.
  • Environment variables are set earlier in jobs as a separate step if it is
    currently exported multiple times.
  • Indentation was really bothering me because it switches back and forth
    throughout the workflow file, so lists are now indented.

Signed-off-by: Kevin Leimkuhler [email protected]

Signed-off-by: Kevin Leimkuhler <[email protected]>
@kleimkuhler kleimkuhler self-assigned this Feb 10, 2020
@kleimkuhler kleimkuhler requested review from alpeb and siggy February 10, 2020 16:08
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kleimkuhler I'll be testing this thoroughly 😉
The indentation changes make the diff a little hard to read. Where is that it was inconsistent before? I've usually seen list elements in k8s manifests and elsewhere being formatted without an extra indentation...

.github/workflows/workflow.yml Show resolved Hide resolved
if: github.event_name != 'pull_request' || !github.event.pull_request.head.repo.fork
env:
PROXY_INIT_IMAGE_NAME: gcr.io/linkerd-io/proxy-init:v1.3.1
PROMETHEUS_IMAGE_NAME: prom/prometheus:v2.11.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been on prom v2.15.2 for a while, not sure how has this been working without updating this tag...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yea I'm not sure. I can update it, but maybe as a separate PR? I'd like to keep a version bump out of this if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image name is for seeding the kind cluster docker caches prior to Linkerd installation. It makes CI (and Linkerd's bootup) faster. It should be kept in-sync with Linkerd, but worst case it just slows things down.

@kleimkuhler
Copy link
Contributor Author

@alpeb When lists are the value of a dictionary I've tended to see them indented just like any other value would be. Also why I changed this is I've had to un-indent new lists in order to match what we currently have. Lastly pretty much any YAML documentation I look at indents lists as well.

@alpeb
Copy link
Member

alpeb commented Feb 10, 2020

@kleimkuhler it must be a kubernetes thing then. Every yaml coming out of kubectl get -oyaml doesn't use indentation for lists, nor have we in our test golden files or Helm templates. I've also put together this snippet that uses https://github.com/kubernetes-sigs/yaml, the same lib that k8s and ourselves rely on.

@kleimkuhler
Copy link
Contributor Author

@alpeb Okay interesting yea I can remove the indentation then

Signed-off-by: Kevin Leimkuhler <[email protected]>
@kleimkuhler kleimkuhler requested a review from alpeb February 10, 2020 20:45
Comment on lines +208 to +209
. bin/_tag.sh
echo ::set-env name=TAG::$(CI_FORCE_CLEAN=1 bin/root-tag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to declare env vars at the top level, that would get shared across all jobs. That might work to avoid repeating this in docker_build, kind_integration_tests and docker_push

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But might be better to try that after the workflow.yml split off, so we don't have to add this back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep in the workflow.yml split off I'd like to make a similar step for other jobs that set environment multiple times.

I'm not sure this can be set at the top level though since it relies on having checked out the code already. Unless you're suggesting that a step (after actions/checkout) can set a workflow level environment variable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you're suggesting that a step (after actions/checkout) can set a workflow level environment variable?

Yea, I know you can declare a global env var, but I'm not sure if it's modifiable from within a job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that could be helpful to set early in a workflow. After the workflow files are split I can see if there is a more global ::set-env option, but I don't see something like that right now.

Signed-off-by: Kevin Leimkuhler <[email protected]>
Signed-off-by: Kevin Leimkuhler <[email protected]>
Signed-off-by: Kevin Leimkuhler <[email protected]>
@kleimkuhler
Copy link
Contributor Author

This renamed Go Format to Go format and Validate go deps to Go dependencies, so those two required checks will not actually run. The required checks will need to be updated again.

@kleimkuhler kleimkuhler force-pushed the kleimkuhler/load-from-host branch from 4f4efd6 to 01d6889 Compare February 11, 2020 22:03
Copy link
Member

@siggy siggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome. a couple nits then 🚢 👍

.github/workflows/workflow.yml Outdated Show resolved Hide resolved
.github/workflows/workflow.yml Show resolved Hide resolved
.github/workflows/workflow.yml Outdated Show resolved Hide resolved
@kleimkuhler
Copy link
Contributor Author

I added back the docker_pull job that runs in addition to docker_build.

When I changed the version of proxy-init and prometheus images, the docker save .. command failed in kind_integration_tests because the new prometheus version was not pulled down on the packet host.

That job is still helpful to to run!

@alpeb
Copy link
Member

alpeb commented Feb 12, 2020

@kleimkuhler It appears kind_integration_tests are not running in forks because they now require docker_pull which only runs in master. Perhaps it would make sense to forgo docker_pull and instead put the Ensure Packet has the correct proxy-init and prometheus versions right before kind_integration_tests's Load cli-bin image into local docker images?

@alpeb
Copy link
Member

alpeb commented Feb 12, 2020

... OTOH I'm not sure if pulling those images is really needed after all. The docker save commands will fail if the images are not there, which only happens right after our weekly docker cache cleanup, and they'll be automatically pulled and be available for the following builds. How about just logging the failure and not having the build step fail?

@kleimkuhler
Copy link
Contributor Author

@alpeb Yea I agree I changed the docker save .. commands to just output STDERR to STDOUT but not fail the script. It should be good to go again.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kleimkuhler! This looks great to me and it's testing fine in both PRs and master 🏆

@olix0r olix0r merged commit a460ada into master Feb 12, 2020
@olix0r olix0r deleted the kleimkuhler/load-from-host branch February 12, 2020 22:38
kleimkuhler added a commit that referenced this pull request Feb 13, 2020
Depends on #4033

## Motivation

If any job fails in the current GH Actions workflow, a re-run on the same
commit SHA requires re-running *all* jobs--regardless if the job already
passed in the previous run.

This can be problematic when dealing with flakiness in the integration tests.

If a test fails due to flakiness in `cloud_integration_tests`, all the unit
tests, static checks, and `kind_integration_tests` are re-run which leads to
lots of waiting and dealing with the possibility of flakiness in earlier jobs.

With this change, individual workflows can now be re-run without triggering
all other jobs to complete again first.

## Solution

`workflow.yml` is now split into:
- `static_checks.yml`
- `unit_tests.yml`
- `kind_integration.yml`
- `cloud_integration.yml`

### Workflows

`statc_checks.yml` performs checks related to dependencies, linting, and
formatting.

`unit_tests.yml` performs the Go and JS unit tests.

`kind_integration.yml` builds the images (on Packet or the GH Action VM) and
runs the integration tests on a KinD cluster. This workflow continues to run
for **all** PRs and pushes to `master` and tags.

`cloud_integration.yml` builds the images only on Packet. This is because
forked repositories do not need to trigger this workflow. It then creates a
unique GKE cluster and runs the integration tests on the cluster.

### The actual flow of work..

A forked repository or non-forked repository opening a PR triggers:
- `static_checks`
- `unit_tests`
- `kind_integration_tests`

These workflows all run in parallel and are invidivually re-runnable.

A push to `master` or tags triggers:
- `static_checks`
- `unit_tests`
- `kind_integration_tests`
- `cloud_integration_tests`

These workflows also all run in parallel, including the `docker_build` step of
both integration test workflows. This has not conflicted in testing as it
takes place on the same Packet host and just utilizes docker layer caching
well.

Signed-off-by: Kevin Leimkuhler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants