ci(cache): only hash top-level yarn.lock for cache resolution #10035

mcous · 2022-04-21T17:44:46Z

Overview

We've been seeing periodic flakiness in cache upload from CI with the following failure:

Error: .github/workflows/app-test-build-deploy.yaml (Line: 120, Col: 16):
Error: The template is not valid. .github/workflows/app-test-build-deploy.yaml (Line: 120, Col: 16): hashFiles('**/yarn.lock') couldn't finish within 120 seconds.

Upon inspection on my own machine, **/yarn.lock matches a bunch of files after make setup-js is run, most of which have nothing to do with our dev environment:

❯ ls **/yarn.lock
app-shell/node_modules/combined-stream/yarn.lock
app-shell/node_modules/react-input-autosize/yarn.lock
app-shell/node_modules/uri-js/yarn.lock
node_modules/@cypress/request/node_modules/form-data/yarn.lock
node_modules/better-opn/yarn.lock
node_modules/browserify-zlib/yarn.lock
node_modules/combined-stream/yarn.lock
node_modules/copy-to-clipboard/yarn.lock
node_modules/detective-postcss/yarn.lock
node_modules/file-exists-dazinatorfork/yarn.lock
node_modules/microevent.ts/yarn.lock
node_modules/minimalcss/yarn.lock
node_modules/postcss-discard-overridden/yarn.lock
node_modules/react-input-autosize/yarn.lock
node_modules/request/node_modules/form-data/yarn.lock
node_modules/socks-proxy-agent/yarn.lock
node_modules/sumchecker/yarn.lock
node_modules/tiny-emitter/yarn.lock
node_modules/uri-js/yarn.lock
node_modules/worker-rpc/yarn.lock
yarn.lock

Except for our own yarn.lock, all those other lock files seem to be stuff those npm packages have accidentally published. They're not used by yarn when resolving/installing our development environment.

Unless hashFiles is smart enough to skip node_modules, we could be wasting CI time including calculating the hash of all of these files. There's a good chance node_modules isn't in play when the cache key is calculated (though GitHub Actions seems to be re-hashing in the post-cache step). Even so, though, the globstar would cause some file tree traversal, which could be bad on macOS Ci runners that are known to occasionally have slow filesystem access.

This PR replaces hashFiles('**/yarn.lock') with hashFiles('yarn.lock'), which I hope could avoid hitting this 2 minute hash timeout.

Updates after publishing the PR

Upon investigation of previous CI runs and those triggered by this PR, this is my working theory of the timeline of the failure:

Cache step runs
1. Workflow YAML file is read, hashFiles is called
2. node_modules is not yet present, so only top-level yarn.lock gets hashed, producing the correct hash key
3. Cache restored
CI runs
Post-cache step runs
1. Workflow YAML file is read again for some reason, hashFiles is called again
2. Post-cache checks that the cache was hit in step (1), declines to upload new cache
3. Newly computed cache key is thrown away because it wasn't actually needed

Evidence / example runs:

The cache keys produced by this PR are identical to edge, indicating that the initial file hashing was only grabbing the single yarn.lock
edge post-cache on macOS takes ~1 minute simply to skip the upload due to cache hit
- Since this job isn't doing anything, and the failure we've been seeing was due to hashFiles timeout, it must be running hashFiles in the post-cache step, even though it isn't necessary
- https://github.com/Opentrons/opentrons/runs/6103940404?check_suite_focus=true
ci_hash-top-yarn-lock post-cache on macOS takes 0s to do the same thing (skip cache upload due to cache hit)
- https://github.com/Opentrons/opentrons/runs/6116965518?check_suite_focus=true

Changelog

ci(cache): only hash top-level yarn.lock for cache resolution

Review requests

CI is green
My "facts" above are actually true

Risk assessment

There are two failure modes that I can think of:

We bust the cache too infrequently, leading to builds pulling in state caches (more likely, given the change)
We bust the cache too much, resulting in longer CI times

(1) can result in some pretty pernicious stuff, so we should be extra sure that my assessment of the situation is correct.

codecov · 2022-04-21T17:48:35Z

Codecov Report

Merging #10035 (5bcdddf) into edge (1cdd8d3) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             edge   #10035   +/-   ##
=======================================
  Coverage   74.68%   74.68%           
=======================================
  Files        2108     2108           
  Lines       55336    55336           
  Branches     5620     5620           
=======================================
  Hits        41326    41326           
  Misses      12872    12872           
  Partials     1138     1138

Flag	Coverage Δ
app	`71.42% <ø> (ø)`
components	`52.80% <ø> (ø)`
labware-library	`49.74% <ø> (ø)`
protocol-designer	`44.88% <ø> (ø)`
react-api-client	`80.76% <ø> (ø)`
shared-data	`85.02% <ø> (ø)`
step-generation	`86.49% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

b-cooper

Changes look good and the logic makes sense to me 🦞 Thanks for looking into this!

ci(cache): only hash top-level yarn.lock for cache resolution

5bcdddf

mcous added the chore label Apr 21, 2022

mcous requested a review from a team as a code owner April 21, 2022 17:44

b-cooper approved these changes Apr 21, 2022

View reviewed changes

mcous merged commit 76cab37 into edge Apr 21, 2022

mcous deleted the ci_hash-top-yarn-lock branch April 21, 2022 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(cache): only hash top-level yarn.lock for cache resolution #10035

ci(cache): only hash top-level yarn.lock for cache resolution #10035

mcous commented Apr 21, 2022 •

edited

Loading

codecov bot commented Apr 21, 2022 •

edited

Loading

b-cooper left a comment

ci(cache): only hash top-level yarn.lock for cache resolution #10035

ci(cache): only hash top-level yarn.lock for cache resolution #10035

Conversation

mcous commented Apr 21, 2022 • edited Loading

Overview

Updates after publishing the PR

Changelog

Review requests

Risk assessment

codecov bot commented Apr 21, 2022 • edited Loading

Codecov Report

b-cooper left a comment

Choose a reason for hiding this comment

mcous commented Apr 21, 2022 •

edited

Loading

codecov bot commented Apr 21, 2022 •

edited

Loading