Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kbn/optimizer] stable/predictable cache key #74918

Closed
spalger opened this issue Aug 13, 2020 · 7 comments
Closed

[kbn/optimizer] stable/predictable cache key #74918

spalger opened this issue Aug 13, 2020 · 7 comments
Labels
discuss Team:Operations Team label for Operations Team

Comments

@spalger
Copy link
Contributor

spalger commented Aug 13, 2020

We'd like to explore uploading kbn-optimizer caches to a central location so that we could download them from a central location rather than rebuilding them on every developers machine.

The simple solution would be to take the cache keys that we're already generating, hash them, and then store the caches in zips or something behind the hash of the cache key (example). The problem with this approach is that it requires knowing all the files which are in the bundle to generate the cache key for the first time. The only thing we are currently using the cache keys for is determining if the cache can be reused, or needs to be destroyed and rebuilt. As such, it's not a problem that we can't generate a cache key without running the build.

In order to use a distributed cache we would need to be able to determine a cache key from the source without running the build process, though this is basically impossible for front-end code in our repo as any bundle can include code from basically any node_module and any file in the plugin source directory. The cache key needs to factor in the changes to all those datapoints (and only those datapoints) to make sure that the cache is valid for as long as possible and only rebuilt when necessary.

I'm not 100% sure how bazel does this, but I think I might want to take some spacetime and see if I can't find a way to calculate a predicatable cache key from the content of files that traverses from source to packages and understands the dependency tree but can also determine the cache key for a bundle in 100ms or so.

@spalger spalger added discuss Team:Operations Team label for Operations Team labels Aug 13, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@tylersmalley
Copy link
Contributor

With Bazel, we have to explicitly define the dependencies then the builds are run in a sandbox to ensure those dependencies are correct. Since we are unable to do that here, I expect us to need some sort of source crawling to infer the dependencies. If that actually ends up being a requirement it would obviously massively cut down on the opportunity here (time to run Webpack vs time to parse and resolve tree then download and extract cache).

What is the cost of the Typescript conversion for a plugin? If it is high enough, it would be a lot easier to separate the Typescript transform from the Webpack bundling. Creating the cache key would be much more strait-forward. Additionally, it seems like actually using Typescript directly per plugin would have the added benefit of outputting the type definitions. This also aligns with what we would be doing in Bazel.

@joshdover
Copy link
Contributor

joshdover commented Aug 13, 2020

Can someone help me understand why the current caching mechanism (which only works locally) doesn't require reading all of the source code but a distributed cache would?

From my understanding the only inputs that are relevant for a individual plugin's bundle are:

  1. The optimizer configuration
  2. The last modified time of any packages this plugin imports
    • Or something more portable across machines, like the last git commit that affect these files
  3. The last modified time of the plugin's source files
    • Or the last git commit + any local changes, like above
  4. The version number of any node_modules that the plugin imports

I don't believe these things need to be considered:

  • The last modified time of any other plugins that this plugin imports
    • From my understanding of how shared bundles work, we no longer copy any modules from one plugin into another, they are always linked to at runtime
  • The last modified time of any core code
    • Similar to above, this bundle is shared

If my assumptions here are correct, then the only hard part seems to be (4). It seems we could get this by proxy by just calculating a hash of the yarn.lock file. We don't even need to do this on a per-plugin basis and we can just cache by which version of the yarn.lock file you are using.

@spalger
Copy link
Contributor Author

spalger commented Aug 14, 2020

help me understand why the current caching mechanism (which only works locally) doesn't require reading all of the source code but a distributed cache would?

The current caching mechanism does read all of the source code when creating the cache key. It includes a list of all the files that are included in the bundle and the modified time of the file when it was included into the bundle, including the package.json files of the modules that were imported by the bundle. To invalidate the cache key one of the files in that list needs to have a new modified timestamp, otherwise passing those files back into webpack would create the exact same output so we can skip recreating the bundle. I personally think this accuracy is important.

We now limit the files that can be found in a bundle to prevent files from other plugins finding their way into a bundle, but until we have an even more serious limit (like the sandboxing that Tyler describes) I don't think we can try to guess which files are in the bundle and expect accurate results.

And I really don't think we want inaccurate cache key generation.

All that said, I think we can do something similar to haste-map and discover the entire module tree without taking too much time and ultimately making a very accurate cache key. I'm still planning to do this in my space time and if we can do this we could use that key as the distributed cache key.

@spalger
Copy link
Contributor Author

spalger commented Aug 18, 2020

I might have an idea:

  • convert the cache key to be machine agnostic by converting absolute paths to relative paths with normalized path separators
  • on every commit to tracked branches build the repo as quickly as possible and upload the cache to a bucket along with a manifest for that commit.
  • When initializing the optimizer:
    • determine the merge-base with the branch referenced in the root package.json file (assumed to be a tracked branch)
    • attempt to download the manifest for the merge-base commit, if that fails we could try a few commits traversing back in time.
    • determine the changed paths from the merge base commit (including uncommitted changes) the manifest includes the necessary metadata to determine which bundles can be downloaded based on the changed paths
    • download any bundle which only references files which haven't changed since the last commit, the manifest includes the urls for every bundles files

Eventually:

@tylersmalley
Copy link
Contributor

I wonder if trying to optimize what plugins we should download would actually be more performant.

In doing so, we would need to walk the commits until we fetch a manifest. Analyze it to identify which plugins need to be fetched, then fetch each individual plugin (until we have a way to mark these as being served from the CDN).

If instead, the entire cache, manifest, and all would be populated on bootstrap. When building the platform plugins it would only build what is changed from the cache which is it already handles.

@tylersmalley tylersmalley added 1 and removed 1 labels Oct 11, 2021
@tylersmalley
Copy link
Contributor

tylersmalley commented Feb 16, 2022

I am going to close this, and we're moving forward with migrating the optimizer to Bazel and not improving the kbn/optimizer caching strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Operations Team label for Operations Team
Projects
None yet
Development

No branches or pull requests

4 participants