Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A modest ccache proposal #54558

Merged
merged 4 commits into from
Jan 19, 2022

Conversation

akrieger
Copy link
Member

@akrieger akrieger commented Jan 18, 2022

Summary

None

Purpose of change

The current ccache configuration sorta mostly works, but there's been headaches around ccache internal eviction because of the artificially low limit we set to avoid gha cache eviction. ccache internally shards its cache into 16 buckets, and applies a limit of /16 on size and file counts. As a result, a 1GB ccache limit means a single bucket can only hold 64MB of objects. For the basic build, the PCH file by itself is 60MB, so it gets trivially evicted if anything else shards into its bucket. We see this manifesting as pessimal ccache performance today. Somehow we need to raise the ccache cache limits without also creating eviction issues for our useful gha cache artifacts.

Describe the solution

This is a fusion of earlier ideas when brainstorming how to fix the ccache issues. A key insight I didn't have before is all PR runs rebase onto master before running, which means it's desirable to have as fresh artifacts from master as possible.

  1. Configure CCACHE_SLOPPINESS in the Makefile to include include_file_ctime,include_file_mtime which is categorically useful and could be it's own PR, but we're here now.
  2. Configure master builds to run the matrix completely parallel, but skipping running tests. The idea is to write to cache as fast as possible after a merge. This is done by using an out-of-matrix separate job to set some job outputs which are values to pass to the matrix job configuration, because of the limited power of github action expression syntax.
  3. Reconfigure the gha ccache cache key. Revert the cache key to using a timestamp including hours/minutes, which basically ensures every run will have a unique cachekey. Insert ${{ github.ref_name }} into the primary cache key. This is so when a PR writes to cache, it writes to ccache-PR-<config>, and master writes to ccache-master-<config>. This is important, because lastly, we set restore-keys for all runs to lookup ccache-master-<config>-, which means that when a PR misses it'll only restore from master caches, and not its own internal caches. Why is that important? Because:
  4. PRs will eat their children clear ccache after building. That way when the PR run uploads to cache, it'll upload an empty bundle with a miniscule number of bytes in it, likely avoiding evicting any useful caches. Because of the restore-keys hack from above, this will never get 'hit' and thus will be first on the chopping block for eviction down the line, for however few bytes it'll reclaim.
  5. Double the ccache size limits. This should avoid ccache internally evicting the pch file it generates during the same build, which means it'll hit next time and not invalidate the whole build. Depending on performance this will probably give us good leeway for increasing the sizes further because we don't care if master builds evict older master builds.
  6. And bonus extra, tweak the ccache environment vars, the most important being setting CCACHE_NOCOMPRESS: false globally. This only practically matters for the macOS build which would use zstd internally, but the zstd settings we use mean that compressing the entire ccache archive at once in the actions/cache upload step will use less total gha cache space, on the order of 33% less (150->100mb). In the future we could experiment with setting this to true globally, but older ccache's on the ubuntu runners use zlib instead of zstd and will perform far worse.

Describe alternatives you've considered

  • Disabling PCH for at least the basic build, the one with the most problems. It's an orthogonal solution and doesn't globally address the daily ccache increasing-staleness and continual potential for eviction of the master caches from a slew of PRs coming in at once.

Testing

Additional context

The gcc11 ASAN build is still cursed. It is still suffering from 100% cache misses. I'm confident this is not related to cache size issues because the builds which upload to cache are not evicting any artifacts in a clean build.

[It] should be the subtitle for ccache: still no cure for linking
-- kevingranade

https://en.wikipedia.org/wiki/A_Modest_Proposal

@github-actions github-actions bot added json-styled JSON lint passed, label assigned by github actions astyled astyled PR, label is assigned by github actions labels Jan 18, 2022
@Maleclypse Maleclypse added the Code: Build Issues regarding different builds and build environments label Jan 18, 2022
@akrieger akrieger force-pushed the modest_ccache_proposal branch from b0c8966 to aaffb31 Compare January 18, 2022 22:40
@kevingranade
Copy link
Member

I was unclear about what I meant by, "with the generation count thing" on discord.

What I meant was:
In master builds
grab the timestamp as usual.
Use that timestamp as part of the cache key. Restore-keys is just missing the timestamp.
NEW write timestamp as late as possible in the build because there's a race condition to an artifact with actions/upload-artifact like at

- uses: actions/upload-artifact@v2

In pr builds
Get the most recent artifact with a timestamp using action-doemwnload-artifact as seen in

- name: Download success artifact from basic build

Use that timestamp in the primary key, restore-keys is just missing the timestamp
There's a race condition where master writes an artifact before uploading its cache archive, so we still need to wipe the cache to prevent evictions, but it should be super rare.

@akrieger
Copy link
Member Author

akrieger commented Jan 19, 2022

I don't understand the race condition 'to an artifact' you are describing. Even so, I don't see any races here. It's intentionally structured to always fail the primary cache lookup, on purpose, and rely on restore-keys to hit the most recent master cache artifact available as per the documentation of how restore keys works with multiple matching entries.

restore-keys always goes to master because it's hard coded to use master, even on PRs.

@kevingranade kevingranade merged commit 840220f into CleverRaven:master Jan 19, 2022
@github-actions github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 19, 2022
@akrieger akrieger deleted the modest_ccache_proposal branch January 19, 2022 17:52
scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Sep 25, 2023
scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Sep 27, 2023
scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Oct 1, 2023
scarf005 added a commit to scarf005/Cataclysm-BN that referenced this pull request Oct 3, 2023
github-merge-queue bot pushed a commit to cataclysmbnteam/Cataclysm-BN that referenced this pull request Oct 13, 2023
* ci: remove EXTRA_TEST_OPTS

not used ATM

* ci: install SDL dependencies only for tiles build

see: CleverRaven/Cataclysm-DDA#55088

Co-authored-by: Brett Dong <[email protected]>

* ci: order title first

for readability

* ci: limit resource usages on repeated pushes

see:
- CleverRaven/Cataclysm-DDA#54306
- CleverRaven/Cataclysm-DDA#54617

Co-authored-by: Kevin Granade <[email protected]>

* ci: add `ccache_limit`  and `ccache_key`

* ci: correctly cache ccache

see:
- CleverRaven/Cataclysm-DDA#54078
- CleverRaven/Cataclysm-DDA#64553

Co-authored-by: Kevin Granade <[email protected]>
Co-authored-by: Brett Dong <[email protected]>

* ci: ccache stats and cleanup

see: CleverRaven/Cataclysm-DDA#64553

Co-authored-by: Brett Dong <[email protected]>

* ci: write ccache from upload only

see: CleverRaven/Cataclysm-DDA#54558

Co-authored-by: Andrew Krieger <[email protected]>

* ci: better performance ccache options

see: CleverRaven/Cataclysm-DDA@aaffb31

Co-authored-by: Andrew Krieger <[email protected]>

* ci: build only and test only script

see: CleverRaven/Cataclysm-DDA#54435

Co-authored-by: Brett Dong <[email protected]>
Co-authored-by: Alexey <[email protected]>

* ci: add some line breaks

my eyes hurt

* ci: skip via either code or data

see:
- CleverRaven/Cataclysm-DDA@5707742
- CleverRaven/Cataclysm-DDA@f6ec25f

Co-authored-by: Kevin Granade <[email protected]>
Co-authored-by: Andrew Krieger <[email protected]>

* ci: separate build and test step

see: CleverRaven/Cataclysm-DDA#54435

Co-authored-by: Brett Dong <[email protected]>

* ci: bump actions/upload-artifact to v3

see: CleverRaven/Cataclysm-DDA#64962

Co-authored-by: casswedson <[email protected]>

* ci: ignore upload fail after build fails

see: CleverRaven/Cataclysm-DDA@b7dc997

Co-authored-by: John Bytheway <[email protected]>

* ci: emit success artifact

* fix: use num_jobs correctly

* Update .github/workflows/matrix.yml

Co-authored-by: Olanti <[email protected]>

* Update .github/workflows/matrix.yml

Co-authored-by: Olanti <[email protected]>

* ci: remove object-creator

* ci: add run-on-draft flag

* ci: match languages in translation test

* ci: use `upload-artifact`

* Update build-scripts/gha_compile_only.sh

Co-authored-by: Olanti <[email protected]>

* test: fix catch2 tag format

* test: mark overmap lab generation test as flaky

* test: mark ranged aiming test as flaky

* test: use anonymous namespace

* test: adjust grenade lethality to match BN value

* ci: remove gh push trigger

it didn't work

* ci: reduce cache limit

github cache limit is 10GB

* Revert "test: mark overmap lab generation test as flaky"

This reverts commit 71f6840.

* ci: does it work?

* ci: remove `FRAMEWORK` option

* ci: fix restore key

---------

Co-authored-by: Brett Dong <[email protected]>
Co-authored-by: Kevin Granade <[email protected]>
Co-authored-by: Andrew Krieger <[email protected]>
Co-authored-by: Alexey <[email protected]>
Co-authored-by: casswedson <[email protected]>
Co-authored-by: John Bytheway <[email protected]>
Co-authored-by: Olanti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
astyled astyled PR, label is assigned by github actions BasicBuildPassed This PR builds correctly, label assigned by github actions Code: Build Issues regarding different builds and build environments json-styled JSON lint passed, label assigned by github actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants