Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configuration option for maximum dune cache size #8274

Open
brycenichols opened this issue Jul 25, 2023 · 16 comments
Open

configuration option for maximum dune cache size #8274

brycenichols opened this issue Jul 25, 2023 · 16 comments

Comments

@brycenichols
Copy link

Desired Behavior

We would like for dune to have a config option for a maximum cache size. This size can be maintained as builds proceed so that the user need not worry about how quickly the cache grows or maintain out-of-band processes to periodically trim the cache.

Motivating use case

At Ahrefs, we have a pool of persistent build hosts to handle many build jobs in parallel, and take advantage of a persistent git monorepo clone and cached build files re-used in subsequent build job invocations. We run a buildkite agent on every build host. As well as re-using the state of the environment after some setup steps, we benefit from using dune cache. The problem is that the cache can grow without bound unless it is trimmed periodically. While we could introduce a build step either before or after a build job to run the trim operation, this eliminates much of the benefit of improved speed from using the cache. When trimming to 50GB the process can take a couple of minutes and that just adds to the total time to complete the build job.

To deal with the issue of an ever-expanding cache without introducing extra time-consuming steps in our pipeline, we currently schedule a dedicated trim pipeline to run on all known agents. The issue with this is that there's no way to know the appropriate scheduling for these out-of-band processes and they end up periodically blocking availability of the agents, particularly when load is high and they are most needed. Furthermore, we have found the step of querying and/or maintaining the list of agents to be brittle. Buildkite, like similar build scheduling tools is designed to hand out work to any one available agent, not script something to run on all of them. Agents may be added, disconnected, disabled, or re-enabled at any time so it doesn't make sense to run any given process across the whole pool.

This brings us to this request. The very nice to have (and in the spirit of the original design for the dune cache) would be for the trimming to happen as-needed in real-time so that the user is guaranteed some upper limit on the size without needing to always run the trim process and incur some cost each time.

@Alizter
Copy link
Collaborator

Alizter commented Aug 2, 2023

Unfortunately a lot of the work that goes into dune cache trim is actually just calculating the size. You can see this by running dune cache size. Therefore in order to check an upper limit on size, this slow work will have to be done anyway.

Rather than adding slow size checks to dune build itself, this might be better served by a dedicated shared cache server. You can imagine when we have a server handling a distributed cache we could allow it to manage the size better than we currently do.

cc @rgrinberg

@brycenichols
Copy link
Author

I was curious and took a look at what ccache does. From what I read (https://ccache.dev/manual/4.3.html#_cache_size_management), they maintain counters for size and number of cached files for each of 16 subdirectories in the cache. The stated reason for multiple files is performance and concurrency.

@Alizter
Copy link
Collaborator

Alizter commented Aug 3, 2023

cc @snowleopard any opinion on this?

@snowleopard
Copy link
Collaborator

We've recently implemented some eager cache trimming in Jenga, which is not exactly what is being asked here but along the lines of making trimming happen during the builds, as opposed to being scheduled.

Personally, I'd welcome some work in this direction, though ideally it would happen after our internal migration from Jenga to Dune is complete (maybe in a few months), as I expect any changes around caching to be pretty disruptive right now.

@emillon
Copy link
Collaborator

emillon commented Oct 2, 2023

I started having a look at this.
There are definitely design decisions to take because the naive solution of calling dune cache trim at the end of the build isn't going to cut it.

  • Having an estimate of the cache size is very useful. We could have a special file store that, which would be updated whenever a full iteration on the FS is done (manual trims, or when an automatic one actually triggers). Because of hard links, it might be difficult to keep this up to date in each build.
  • I wonder if there are platform/FS specific APIs we could use to do size estimation.
  • We need to think of what needs to happen when the cache is full: we probably don't need to try very hard to reclaim space file per file to keep the cache as full as possible; rather I would think that some form of hysteresis would be useful: when the cache hits a certain threshold (say it grows to 10.1GB where the limit is 10GB), instead of removing the bare minimum to keep it below the limit, do a larger cleanup to get it under a lower limit (say 8GB).
  • The idea of having sub-caches is neat because it makes estimation cheaper.

@Alizter
Copy link
Collaborator

Alizter commented Oct 2, 2023

@emillon On Unix we have du -sh which should do the job nicely. There is a Windows equivalent I can elaborate on if you would like. They would both give a good estimate of the size of the cache.

If the cache limit is set, we could try running these commands when dune exits (not sure about watch mode). It would then do a rough comparison and tell the user that they should run dune cache trim (no options).

We would then make dune cache trim do something clever like trim the cache to 75% of the max limit.

@emillon
Copy link
Collaborator

emillon commented Oct 2, 2023

Sorry, for that part I meant that there might be FS-specific operations to give estimates without calling stat on the whole hierarchy (which is what du -sh does). pg_total_relation_size vs select count(*).

@Alizter
Copy link
Collaborator

Alizter commented Oct 2, 2023

We could cache the stat calls that du does, which would be a good speedup between runs. Windows also has indexing for search which we could use there.

@rgrinberg
Copy link
Member

The right fix is probably to keep track of the size of the cache as we're writing to it. That's an invasive change that we shouldn't undertake at the moment.

In the meantime, I would suggest that we implement eager cache trimming and see how far that gets us.

@pmwhite do you think you could import eager cache trimming?

@pmwhite
Copy link
Collaborator

pmwhite commented Oct 3, 2023

Yeah, once it is implemented, I can try importing it.

@rgrinberg
Copy link
Member

Unless I'm misremembering, I think it's already implemented.

@emillon
Copy link
Collaborator

emillon commented Oct 3, 2023

Yes I was about to ask for clarification about what is meant by "eager cache trimming".

@snowleopard
Copy link
Collaborator

snowleopard commented Oct 5, 2023

We implemented the "eager cache trimming" feature in Jenga internally. When Jenga runs an action, it deletes the previous versions of the action's targets from the cache, it they are unused in other workspaces. It works pretty well in practice, especially if you keep tweaking a test over and over (in which case you often end up with dozens of old versions of the test runner binary in the cache).

We plan to implement "eager cache trimming" in Dune too and upstream it in the next month or so.

@ElectreAAS
Copy link
Collaborator

We plan to implement "eager cache trimming" in Dune too and upstream it in the next month or so.

Any news on this? or the more general question of bounding cache disk usage?

@snowleopard
Copy link
Collaborator

snowleopard commented Dec 7, 2024

We've implemented eager cache trimming in our internal version of Dune, but have been struggling with upstreaming any of our changes (we planned to organise an upstreaming workshop with Tarides but it had to be postponed until 2025).

Regarding "bounding cache disk usage": our approach is to run the cache trimming job regularly (once per hour). I'm not sure this approach is going to work externally, since there is no universal way to set up regular jobs. We might want a more portable and bespoke solution externally.

@edwintorok
Copy link
Contributor

edwintorok commented Dec 9, 2024

We might want a more portable and bespoke solution externally.

Dune could try to treat the symptoms:

  • when it gets an out of space error it can immediately run cache trimming to the configured size (probably by using a file lock in the cache dir, to prevent other dune processes doing the same immediately)
  • at the beginning and end of each dune build / runtest command when free space on .cache's filesystem is below a certain threshold and sufficient time has elapsed since the previous invocation (e.g. >1m).
  • otherwise periodically trim quota, (e.g. using user-level systemd timer services, cronjobs, but this would be OS specific). A simple approach would be for dune itself to check when was the cache last trimmed, and if >1h ago, then trim it at the end of the current dune invocation.
  • dune build --watch should also check this timer, and trigger quota trimming

(of course all the above values should be configurable)

This would avoid wasting too much time when dune build is called repeatedly during development, and still avoid the most common problems (running out or low on disk space).

Other options are OS/FS specific:

The most portable OS-level solution would be to have a separate userid or groupid for dune, and then FS-level quotas could be enforced for this (although still likely require root privileges to set up, which means it won't be possible to set it up in shared environments)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants