[kbn/optimizer] stable/predictable cache key #74918

spalger · 2020-08-13T00:05:21Z

We'd like to explore uploading kbn-optimizer caches to a central location so that we could download them from a central location rather than rebuilding them on every developers machine.

The simple solution would be to take the cache keys that we're already generating, hash them, and then store the caches in zips or something behind the hash of the cache key (example). The problem with this approach is that it requires knowing all the files which are in the bundle to generate the cache key for the first time. The only thing we are currently using the cache keys for is determining if the cache can be reused, or needs to be destroyed and rebuilt. As such, it's not a problem that we can't generate a cache key without running the build.

In order to use a distributed cache we would need to be able to determine a cache key from the source without running the build process, though this is basically impossible for front-end code in our repo as any bundle can include code from basically any node_module and any file in the plugin source directory. The cache key needs to factor in the changes to all those datapoints (and only those datapoints) to make sure that the cache is valid for as long as possible and only rebuilt when necessary.

I'm not 100% sure how bazel does this, but I think I might want to take some spacetime and see if I can't find a way to calculate a predicatable cache key from the content of files that traverses from source to packages and understands the dependency tree but can also determine the cache key for a bundle in 100ms or so.

elasticmachine · 2020-08-13T00:05:23Z

Pinging @elastic/kibana-operations (Team:Operations)

tylersmalley · 2020-08-13T15:45:40Z

With Bazel, we have to explicitly define the dependencies then the builds are run in a sandbox to ensure those dependencies are correct. Since we are unable to do that here, I expect us to need some sort of source crawling to infer the dependencies. If that actually ends up being a requirement it would obviously massively cut down on the opportunity here (time to run Webpack vs time to parse and resolve tree then download and extract cache).

What is the cost of the Typescript conversion for a plugin? If it is high enough, it would be a lot easier to separate the Typescript transform from the Webpack bundling. Creating the cache key would be much more strait-forward. Additionally, it seems like actually using Typescript directly per plugin would have the added benefit of outputting the type definitions. This also aligns with what we would be doing in Bazel.

joshdover · 2020-08-13T17:37:57Z

Can someone help me understand why the current caching mechanism (which only works locally) doesn't require reading all of the source code but a distributed cache would?

From my understanding the only inputs that are relevant for a individual plugin's bundle are:

The optimizer configuration
The last modified time of any packages this plugin imports
- Or something more portable across machines, like the last git commit that affect these files
The last modified time of the plugin's source files
- Or the last git commit + any local changes, like above
The version number of any node_modules that the plugin imports

I don't believe these things need to be considered:

The last modified time of any other plugins that this plugin imports
- From my understanding of how shared bundles work, we no longer copy any modules from one plugin into another, they are always linked to at runtime
The last modified time of any core code
- Similar to above, this bundle is shared

If my assumptions here are correct, then the only hard part seems to be (4). It seems we could get this by proxy by just calculating a hash of the yarn.lock file. We don't even need to do this on a per-plugin basis and we can just cache by which version of the yarn.lock file you are using.

spalger · 2020-08-14T02:29:54Z

help me understand why the current caching mechanism (which only works locally) doesn't require reading all of the source code but a distributed cache would?

The current caching mechanism does read all of the source code when creating the cache key. It includes a list of all the files that are included in the bundle and the modified time of the file when it was included into the bundle, including the package.json files of the modules that were imported by the bundle. To invalidate the cache key one of the files in that list needs to have a new modified timestamp, otherwise passing those files back into webpack would create the exact same output so we can skip recreating the bundle. I personally think this accuracy is important.

We now limit the files that can be found in a bundle to prevent files from other plugins finding their way into a bundle, but until we have an even more serious limit (like the sandboxing that Tyler describes) I don't think we can try to guess which files are in the bundle and expect accurate results.

And I really don't think we want inaccurate cache key generation.

All that said, I think we can do something similar to haste-map and discover the entire module tree without taking too much time and ultimately making a very accurate cache key. I'm still planning to do this in my space time and if we can do this we could use that key as the distributed cache key.

spalger · 2020-08-18T03:59:45Z

I might have an idea:

convert the cache key to be machine agnostic by converting absolute paths to relative paths with normalized path separators
on every commit to tracked branches build the repo as quickly as possible and upload the cache to a bucket along with a manifest for that commit.
When initializing the optimizer:
- determine the merge-base with the branch referenced in the root package.json file (assumed to be a tracked branch)
- attempt to download the manifest for the merge-base commit, if that fails we could try a few commits traversing back in time.
- determine the changed paths from the merge base commit (including uncommitted changes) the manifest includes the necessary metadata to determine which bundles can be downloaded based on the changed paths
- download any bundle which only references files which haven't changed since the last commit, the manifest includes the urls for every bundles files

Eventually:

we might want to consider integrating this with Support for serving static assets over a CDN #72880 and just serve the files directly from the CDN without needing to even be downloaded

tylersmalley · 2020-08-18T15:39:39Z

I wonder if trying to optimize what plugins we should download would actually be more performant.

In doing so, we would need to walk the commits until we fetch a manifest. Analyze it to identify which plugins need to be fetched, then fetch each individual plugin (until we have a way to mark these as being served from the CDN).

If instead, the entire cache, manifest, and all would be populated on bootstrap. When building the platform plugins it would only build what is changed from the cache which is it already handles.

tylersmalley · 2022-02-16T20:46:12Z

I am going to close this, and we're moving forward with migrating the optimizer to Bazel and not improving the kbn/optimizer caching strategy.

spalger added discuss Team:Operations Team label for Operations Team labels Aug 13, 2020

spalger mentioned this issue Aug 13, 2020

[kbn/optimizer] meta issue #68321

Closed

15 tasks

tylersmalley added 1 and removed 1 labels Oct 11, 2021

tylersmalley closed this as completed Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kbn/optimizer] stable/predictable cache key #74918

[kbn/optimizer] stable/predictable cache key #74918

spalger commented Aug 13, 2020

elasticmachine commented Aug 13, 2020

tylersmalley commented Aug 13, 2020

joshdover commented Aug 13, 2020 •

edited

Loading

spalger commented Aug 14, 2020 •

edited

Loading

spalger commented Aug 18, 2020 •

edited

Loading

tylersmalley commented Aug 18, 2020

tylersmalley commented Feb 16, 2022 •

edited

Loading

[kbn/optimizer] stable/predictable cache key #74918

[kbn/optimizer] stable/predictable cache key #74918

Comments

spalger commented Aug 13, 2020

elasticmachine commented Aug 13, 2020

tylersmalley commented Aug 13, 2020

joshdover commented Aug 13, 2020 • edited Loading

spalger commented Aug 14, 2020 • edited Loading

spalger commented Aug 18, 2020 • edited Loading

tylersmalley commented Aug 18, 2020

tylersmalley commented Feb 16, 2022 • edited Loading

joshdover commented Aug 13, 2020 •

edited

Loading

spalger commented Aug 14, 2020 •

edited

Loading

spalger commented Aug 18, 2020 •

edited

Loading

tylersmalley commented Feb 16, 2022 •

edited

Loading