A modest ccache proposal #54558

akrieger · 2022-01-18T21:17:06Z

Summary

None

Purpose of change

The current ccache configuration sorta mostly works, but there's been headaches around ccache internal eviction because of the artificially low limit we set to avoid gha cache eviction. ccache internally shards its cache into 16 buckets, and applies a limit of /16 on size and file counts. As a result, a 1GB ccache limit means a single bucket can only hold 64MB of objects. For the basic build, the PCH file by itself is 60MB, so it gets trivially evicted if anything else shards into its bucket. We see this manifesting as pessimal ccache performance today. Somehow we need to raise the ccache cache limits without also creating eviction issues for our useful gha cache artifacts.

Describe the solution

This is a fusion of earlier ideas when brainstorming how to fix the ccache issues. A key insight I didn't have before is all PR runs rebase onto master before running, which means it's desirable to have as fresh artifacts from master as possible.

Configure CCACHE_SLOPPINESS in the Makefile to include include_file_ctime,include_file_mtime which is categorically useful and could be it's own PR, but we're here now.
Configure master builds to run the matrix completely parallel, but skipping running tests. The idea is to write to cache as fast as possible after a merge. This is done by using an out-of-matrix separate job to set some job outputs which are values to pass to the matrix job configuration, because of the limited power of github action expression syntax.
Reconfigure the gha ccache cache key. Revert the cache key to using a timestamp including hours/minutes, which basically ensures every run will have a unique cachekey. Insert ${{ github.ref_name }} into the primary cache key. This is so when a PR writes to cache, it writes to ccache-PR-<config>, and master writes to ccache-master-<config>. This is important, because lastly, we set restore-keys for all runs to lookup ccache-master-<config>-, which means that when a PR misses it'll only restore from master caches, and not its own internal caches. Why is that important? Because:
PRs will ~~eat their children~~ clear ccache after building. That way when the PR run uploads to cache, it'll upload an empty bundle with a miniscule number of bytes in it, likely avoiding evicting any useful caches. Because of the restore-keys hack from above, this will never get 'hit' and thus will be first on the chopping block for eviction down the line, for however few bytes it'll reclaim.
Double the ccache size limits. This should avoid ccache internally evicting the pch file it generates during the same build, which means it'll hit next time and not invalidate the whole build. Depending on performance this will probably give us good leeway for increasing the sizes further because we don't care if master builds evict older master builds.
And bonus extra, tweak the ccache environment vars, the most important being setting CCACHE_NOCOMPRESS: false globally. This only practically matters for the macOS build which would use zstd internally, but the zstd settings we use mean that compressing the entire ccache archive at once in the actions/cache upload step will use less total gha cache space, on the order of 33% less (150->100mb). In the future we could experiment with setting this to true globally, but older ccache's on the ubuntu runners use zlib instead of zstd and will perform far worse.

Describe alternatives you've considered

Disabling PCH for at least the basic build, the one with the most problems. It's an orthogonal solution and doesn't globally address the daily ccache increasing-staleness and continual potential for eviction of the master caches from a slew of PRs coming in at once.

Testing

Pushed to my own master at https://github.com/akrieger/Cataclysm-DDA/actions/runs/1714292760?check_suite_focus=true, which ran all the builds in parallel and uploaded them all.
Create a PR against my own repo at https://github.com/akrieger/Cataclysm-DDA/actions/runs/1714524501?check_suite_focus=true, which ran the general matrix serially and had full cache hits on the basic build (still took 10 minutes due to linking) and on the clang12 build right after (only took 44 seconds for whole build step). Verified the basic build uploaded a zero size cache (4kb, actually).
Verified macOS ccache compression behavior & action cache upload size at https://github.com/akrieger/Cataclysm-DDA/runs/4858709134?check_suite_focus=true

Additional context

The gcc11 ASAN build is still cursed. It is still suffering from 100% cache misses. I'm confident this is not related to cache size issues because the builds which upload to cache are not evicting any artifacts in a clean build.

[It] should be the subtitle for ccache: still no cure for linking
-- kevingranade

https://en.wikipedia.org/wiki/A_Modest_Proposal

…mtime`

.github/workflows/matrix.yml

kevingranade · 2022-01-19T02:05:13Z

I was unclear about what I meant by, "with the generation count thing" on discord.

What I meant was:
In master builds
grab the timestamp as usual.
Use that timestamp as part of the cache key. Restore-keys is just missing the timestamp.
NEW write timestamp as late as possible in the build because there's a race condition to an artifact with actions/upload-artifact like at

Cataclysm-DDA/.github/workflows/matrix.yml

Line 268 in 7468747

- uses: actions/upload-artifact@v2

In pr builds
Get the most recent artifact with a timestamp using action-doemwnload-artifact as seen in

Cataclysm-DDA/.github/workflows/test_labeler.yml

Line 25 in 7468747

- name: Download success artifact from basic build

Use that timestamp in the primary key, restore-keys is just missing the timestamp
There's a race condition where master writes an artifact before uploading its cache archive, so we still need to wipe the cache to prevent evictions, but it should be super rare.

akrieger · 2022-01-19T02:15:34Z

I don't understand the race condition 'to an artifact' you are describing. Even so, I don't see any races here. It's intentionally structured to always fail the primary cache lookup, on purpose, and rely on restore-keys to hit the most recent master cache artifact available as per the documentation of how restore keys works with multiple matching entries.

restore-keys always goes to master because it's hard coded to use master, even on PRs.

.github/workflows/matrix.yml

see: CleverRaven/Cataclysm-DDA#54558 Co-authored-by: Andrew Krieger <[email protected]>

* ci: remove EXTRA_TEST_OPTS not used ATM * ci: install SDL dependencies only for tiles build see: CleverRaven/Cataclysm-DDA#55088 Co-authored-by: Brett Dong <[email protected]> * ci: order title first for readability * ci: limit resource usages on repeated pushes see: - CleverRaven/Cataclysm-DDA#54306 - CleverRaven/Cataclysm-DDA#54617 Co-authored-by: Kevin Granade <[email protected]> * ci: add `ccache_limit` and `ccache_key` * ci: correctly cache ccache see: - CleverRaven/Cataclysm-DDA#54078 - CleverRaven/Cataclysm-DDA#64553 Co-authored-by: Kevin Granade <[email protected]> Co-authored-by: Brett Dong <[email protected]> * ci: ccache stats and cleanup see: CleverRaven/Cataclysm-DDA#64553 Co-authored-by: Brett Dong <[email protected]> * ci: write ccache from upload only see: CleverRaven/Cataclysm-DDA#54558 Co-authored-by: Andrew Krieger <[email protected]> * ci: better performance ccache options see: CleverRaven/Cataclysm-DDA@aaffb31 Co-authored-by: Andrew Krieger <[email protected]> * ci: build only and test only script see: CleverRaven/Cataclysm-DDA#54435 Co-authored-by: Brett Dong <[email protected]> Co-authored-by: Alexey <[email protected]> * ci: add some line breaks my eyes hurt * ci: skip via either code or data see: - CleverRaven/Cataclysm-DDA@5707742 - CleverRaven/Cataclysm-DDA@f6ec25f Co-authored-by: Kevin Granade <[email protected]> Co-authored-by: Andrew Krieger <[email protected]> * ci: separate build and test step see: CleverRaven/Cataclysm-DDA#54435 Co-authored-by: Brett Dong <[email protected]> * ci: bump actions/upload-artifact to v3 see: CleverRaven/Cataclysm-DDA#64962 Co-authored-by: casswedson <[email protected]> * ci: ignore upload fail after build fails see: CleverRaven/Cataclysm-DDA@b7dc997 Co-authored-by: John Bytheway <[email protected]> * ci: emit success artifact * fix: use num_jobs correctly * Update .github/workflows/matrix.yml Co-authored-by: Olanti <[email protected]> * Update .github/workflows/matrix.yml Co-authored-by: Olanti <[email protected]> * ci: remove object-creator * ci: add run-on-draft flag * ci: match languages in translation test * ci: use `upload-artifact` * Update build-scripts/gha_compile_only.sh Co-authored-by: Olanti <[email protected]> * test: fix catch2 tag format * test: mark overmap lab generation test as flaky * test: mark ranged aiming test as flaky * test: use anonymous namespace * test: adjust grenade lethality to match BN value * ci: remove gh push trigger it didn't work * ci: reduce cache limit github cache limit is 10GB * Revert "test: mark overmap lab generation test as flaky" This reverts commit 71f6840. * ci: does it work? * ci: remove `FRAMEWORK` option * ci: fix restore key --------- Co-authored-by: Brett Dong <[email protected]> Co-authored-by: Kevin Granade <[email protected]> Co-authored-by: Andrew Krieger <[email protected]> Co-authored-by: Alexey <[email protected]> Co-authored-by: casswedson <[email protected]> Co-authored-by: John Bytheway <[email protected]> Co-authored-by: Olanti <[email protected]>

github-actions bot added json-styled JSON lint passed, label assigned by github actions astyled astyled PR, label is assigned by github actions labels Jan 18, 2022

akrieger and others added 4 commits January 18, 2022 14:15

Extend ccache sloppiness to include `include_file_ctime,include_file_…

8c03882

…mtime`

Master general build matrix should be fully parallel but skip tests.

f6ec25f

Write ccache from master builds only, always fetch from master on PRs

35a7895

Tweak ccache behavior for better performance.

aaffb31

Maleclypse added the Code: Build Issues regarding different builds and build environments label Jan 18, 2022

akrieger force-pushed the modest_ccache_proposal branch from b0c8966 to aaffb31 Compare January 18, 2022 22:40

kevingranade reviewed Jan 19, 2022

View reviewed changes

.github/workflows/matrix.yml Show resolved Hide resolved

akrieger commented Jan 19, 2022

View reviewed changes

.github/workflows/matrix.yml Show resolved Hide resolved

kevingranade merged commit 840220f into CleverRaven:master Jan 19, 2022

github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 19, 2022

akrieger deleted the modest_ccache_proposal branch January 19, 2022 17:52

scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Sep 25, 2023

ci: write ccache from upload only

8aee06b

see: CleverRaven/Cataclysm-DDA#54558 Co-authored-by: Andrew Krieger <[email protected]>

scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Sep 27, 2023

ci: write ccache from upload only

ef209e7

see: CleverRaven/Cataclysm-DDA#54558 Co-authored-by: Andrew Krieger <[email protected]>

scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Oct 1, 2023

ci: write ccache from upload only

e4d1308

see: CleverRaven/Cataclysm-DDA#54558 Co-authored-by: Andrew Krieger <[email protected]>

scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Oct 3, 2023

ci: write ccache from upload only

214ecc6

see: CleverRaven/Cataclysm-DDA#54558 Co-authored-by: Andrew Krieger <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A modest ccache proposal #54558

A modest ccache proposal #54558

akrieger commented Jan 18, 2022 •

edited

Loading

kevingranade commented Jan 19, 2022

akrieger commented Jan 19, 2022 •

edited

Loading

A modest ccache proposal #54558

A modest ccache proposal #54558

Conversation

akrieger commented Jan 18, 2022 • edited Loading

Summary

Purpose of change

Describe the solution

Describe alternatives you've considered

Testing

Additional context

kevingranade commented Jan 19, 2022

akrieger commented Jan 19, 2022 • edited Loading

akrieger commented Jan 18, 2022 •

edited

Loading

akrieger commented Jan 19, 2022 •

edited

Loading