-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REv3: Reduce asymmetry between O(n) output files and O(1) output directories #281
Comments
By reducing the Combining with the discussion in #258, its screaming out to me that we need a separation between the actual cache content (ActionResult blob that is kept on the server side) vs the presentation layer that the client would download. So some clients could opt to get an output tree digest that would make the result relatively smaller, vs having all the paths laid out in the response. Or perhaps some clients would only be interested in directories.
Isn’t this what compression solve out of the box? |
Hi @sluongng,
Note that this issue is written in a very open ended manner. I am not suggesting that we should literally replace all outputs by a single Tree message, or fully decomposed into separate Directory messages. All I'm stating is that we should investigate whether we can eliminate the asymmetry between output files and directories. One could consider encoding the hierarchy of output files using some kind of adaptive data structure, such as B-trees/B+-trees. Or its Merkle tree variant: Prolly trees. If those are sufficiently small, they can be embedded into ActionResult directly. But if they get too large, they can be carved up into additional pieces. Such an approach would actually end up reducing round trips for actions yielding small/shallow output directories compared to what we have now.
To me it's screaming out that as part of REv3 we should do a better job at designing a more efficient representation, so that there's no need for clients to choose between one or the other. As part of REv2 I think #258 strikes a decent balance between impact and ease of shoehorning it into the existing protocol, but I don't consider it to be ideal.
Compression doesn't solve this if you ask me. The working group agreed on letting compression essentially be an overlay. There is no such thing as a canonical representation of a compressed object. This means that if we start to rely on enforcing maximum limits based on the compressed size of objects, it becomes non-deterministic whether a build is allowed to take place or not. Furthermore, Protobuf libraries don't really support parsing Protobuf messages in a streaming manner. This means that even if you were to compress a massive Protobuf message containing many redundant strings into something small, you need to pay the cost at some point decompressing it and storing it in the worker's memory. With regards to expressing output paths, instead of using the simple message OutputPathsSelection {
// If empty, indicate that the current file or directory needs to be captured.
// If non-empty, don't capture the current file or directory. Instead, capture
// files or directories at or below the provided filenames.
map<string, OutputPathsSelection> children = 1;
}
message Command {
....
OutputPathsSelection output_paths = 123;
} So for example, if you wanted to capture these two paths:
You would provide this command: {
"outputPaths": { "children": {
"src": { "children": {
"foo.d": {},
"foo.o": {}
} }
} }
} This might look convoluted in JSON, but at the Protobuf wire format level it should be fairly compact. Such a representation has a couple of advantaged over the existing list of repeated strings:
I tested this representation for some random Bazel test that gets built on my cluster. It looks like this test currently has 2201 bytes of output paths (newline delimited, so larger when stored in a |
I've noticed that people sometimes craft rules (e.g., packaging rules) that have many, many, many output files. This is problematic, for the reason that it makes ActionResult very large. So large that it can't be sent back to the client. ActionResults are not stored in the CAS. This means that clients can't download them in a streaming manner. They are returned as part of the gRPC response, which can generally only be up to 4 MB in size. To prevent Buildbarn from generating ActionResults that are this big, I generally tell my users to configure their clusters to only allow processing of Command messages up to 1-2 MB. As the size of ActionResult is generally proportional to that of Command (1-2x as big), that tends to work.
A solution I often give to my users when they hit these limits is that they should use output directories instead of plain output files. In that case ActionResult remains small. It will only contain a small number of
output_directories
containing references to Tree objects. These can be streamed from the CAS. There are also a couple of advantages on top of that:Though I fully understand where the asymmetry comes from, I do think it's hard to sell to our users. Why is there a difference? From their perspective it's 'tomato tomato'.
I think that as part of REv3 we should investigate whether we can reduce the noticeable differences between using O(n) output files and O(1) output directories. For example, what if every action returns exactly 1 directory hierarchy of outputs, and Command's
output_paths
merely acts as a filter for what needs to be captured as part of that directory hierarchy?(Relatedly, is there anything we can do to reduce the size of Command's
output_paths
by preventing repetition of leading pathnames?)The text was updated successfully, but these errors were encountered: