-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kbn/optimizer] stable/predictable cache key #74918
Comments
Pinging @elastic/kibana-operations (Team:Operations) |
With Bazel, we have to explicitly define the dependencies then the builds are run in a sandbox to ensure those dependencies are correct. Since we are unable to do that here, I expect us to need some sort of source crawling to infer the dependencies. If that actually ends up being a requirement it would obviously massively cut down on the opportunity here (time to run Webpack vs time to parse and resolve tree then download and extract cache). What is the cost of the Typescript conversion for a plugin? If it is high enough, it would be a lot easier to separate the Typescript transform from the Webpack bundling. Creating the cache key would be much more strait-forward. Additionally, it seems like actually using Typescript directly per plugin would have the added benefit of outputting the type definitions. This also aligns with what we would be doing in Bazel. |
Can someone help me understand why the current caching mechanism (which only works locally) doesn't require reading all of the source code but a distributed cache would? From my understanding the only inputs that are relevant for a individual plugin's bundle are:
I don't believe these things need to be considered:
If my assumptions here are correct, then the only hard part seems to be (4). It seems we could get this by proxy by just calculating a hash of the yarn.lock file. We don't even need to do this on a per-plugin basis and we can just cache by which version of the yarn.lock file you are using. |
The current caching mechanism does read all of the source code when creating the cache key. It includes a list of all the files that are included in the bundle and the modified time of the file when it was included into the bundle, including the package.json files of the modules that were imported by the bundle. To invalidate the cache key one of the files in that list needs to have a new modified timestamp, otherwise passing those files back into webpack would create the exact same output so we can skip recreating the bundle. I personally think this accuracy is important. We now limit the files that can be found in a bundle to prevent files from other plugins finding their way into a bundle, but until we have an even more serious limit (like the sandboxing that Tyler describes) I don't think we can try to guess which files are in the bundle and expect accurate results. And I really don't think we want inaccurate cache key generation. All that said, I think we can do something similar to haste-map and discover the entire module tree without taking too much time and ultimately making a very accurate cache key. I'm still planning to do this in my space time and if we can do this we could use that key as the distributed cache key. |
I might have an idea:
Eventually:
|
I wonder if trying to optimize what plugins we should download would actually be more performant. In doing so, we would need to walk the commits until we fetch a manifest. Analyze it to identify which plugins need to be fetched, then fetch each individual plugin (until we have a way to mark these as being served from the CDN). If instead, the entire cache, manifest, and all would be populated on bootstrap. When building the platform plugins it would only build what is changed from the cache which is it already handles. |
I am going to close this, and we're moving forward with migrating the optimizer to Bazel and not improving the kbn/optimizer caching strategy. |
We'd like to explore uploading kbn-optimizer caches to a central location so that we could download them from a central location rather than rebuilding them on every developers machine.
The simple solution would be to take the cache keys that we're already generating, hash them, and then store the caches in zips or something behind the hash of the cache key (example). The problem with this approach is that it requires knowing all the files which are in the bundle to generate the cache key for the first time. The only thing we are currently using the cache keys for is determining if the cache can be reused, or needs to be destroyed and rebuilt. As such, it's not a problem that we can't generate a cache key without running the build.
In order to use a distributed cache we would need to be able to determine a cache key from the source without running the build process, though this is basically impossible for front-end code in our repo as any bundle can include code from basically any node_module and any file in the plugin source directory. The cache key needs to factor in the changes to all those datapoints (and only those datapoints) to make sure that the cache is valid for as long as possible and only rebuilt when necessary.
I'm not 100% sure how bazel does this, but I think I might want to take some spacetime and see if I can't find a way to calculate a predicatable cache key from the content of files that traverses from source to packages and understands the dependency tree but can also determine the cache key for a bundle in 100ms or so.
The text was updated successfully, but these errors were encountered: