Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_exec.proto: support ZSTD with dictionary #276

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion build/bazel/remote/execution/v2/remote_execution.proto
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,13 @@ service ContentAddressableStorage {
rpc GetTree(GetTreeRequest) returns (stream GetTreeResponse) {
option (google.api.http) = { get: "/v2/{instance_name=**}/blobs/{root_digest.hash}/{root_digest.size_bytes}:getTree" };
}

// Fetch the list of dictionaries use in Zstd compression.
//
// Servers MUST implement this RPC method iff the `ZSTD_DICT` compressor is supported.
rpc GetZstdDictionaries(GetZstdDictionariesRequest) returns (GetZstdDictionariesResponse) {
option (google.api.http) = { get: "/v2/{instance_name=**}/blobs:zstdDictionaries" };
}
}

// The Capabilities service may be used by remote execution clients to query
Expand Down Expand Up @@ -1814,6 +1821,40 @@ message GetTreeResponse {
string next_page_token = 2;
}

message GetZstdDictionariesRequest {
// The instance of the execution system to operate against. A server may
// support multiple instances of the execution system (with their own workers,
// storage, caches, etc.). The server MAY require use of this field to select
// between them in an implementation-defined fashion, otherwise it can be
// omitted.
string instance_name = 1;

// The digest function in use by the client, which determines the
// digest function for all Digests in the response.
// Must be one of the server's supported digest functions listed in
// [CacheCapabilities.digest_functions][build.bazel.remote.execution.v2.CacheCapabilities.digest_functions].
DigestFunction.Value digest_function = 2;
}

message GetZstdDictionariesResponse {
message Dictionary {
uint32 id = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could make dictionaries reasonably generic by replacing this uint32 with an Any. That wouldn't account for session-specific dictionaries, but it should allow for ~any static dictionary. The unpacking of the Any, and understanding what blobs it applies to, is very much compressor-specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an "id" specific field, I would say we should use string instead of Any at the minimum. Any delegates the responsibility down to implementations to validate the data format which could be a pain to work with.

If generic is a requirement here, I think something like this could be a good alternative

message GetDictionariesResponse {
  message DefaultDictionaryId {
    oneof id {
      uint32 zstd_dictionary_id = 1;
    }
  }
  // The dictionary ID that clients SHOULD use when uploading
  // objects into the Content Addressable Storage. An entry for
  // this dictionary ID MUST be present in `dictionaries`.
  DefaultDictionaryId default_dictionary_id = 1;

  message ZstdDictionary {
    uint32 id = 1;
    Digest digest = 2;
  }
  message ZstdDictionaries {
    repeated ZstdDictionaries zstd_dictionaries = 1;
  }
  message Dictionaries {
    oneof dictionaries {
      ZstdDictionaries zstd_dictionaries = 1;
    }
  }

  // Dictionaries that may be used by the server when returning
  // objects stored in the Content Addressable Storage. The key
  // corresponds to a dictionary ID, as described in RFC 8878,
  // section 5, while the value refers to a dictionary
  // as described in RFC 8878, chapter 6.
  Dictionaries dictionaries = 2;
}

This would enable future Dictionary types (if there were any) could be added incrementally. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be fine with a string, but given the conversation yesterday regarding people wanting more structured types to avoid the need to shove json into string fields, thought that Any might be appropriate. Using a string for an ID field would be more consistent with the rest of the API. In any case, I don't think that using string vs Any absolves the client from the need to validate the input. In this case of zstd, for example, the client will need to parse the string into an int ID to match with the ID encoded in the compressed blob.

At some level what I'm trying to avoid is needing to modify this API with each dictionary-compression function that we add, which means trying to remove anything zstd-specific. It's not realistic to design an API that will work in all cases without investing significantly in understanding, but there are only so many realistic ways to get dictionary-based compression to work, and I have to believe that "encode a dictionary ID in the compressed blob" is reasonably common.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an "id" specific field, I would say we should use string instead of Any at the minimum. Any delegates the responsibility down to implementations to validate the data format which could be a pain to work with.

But doesn't a string also require that an implementation validates the data format? There are also strings that are not valid integers.

Digest digest = 2;
}

// The dictionary ID that clients SHOULD use when uploading
// objects into the Content Addressable Storage. An entry for
// this dictionary ID MUST be present in `dictionaries`.
uint32 default_dictionary_id = 1;

// Dictionaries that may be used by the server when returning
// objects stored in the Content Addressable Storage. The key
// corresponds to a dictionary ID, as described in RFC 8878,
// section 5, while the value refers to a dictionary
// as described in RFC 8878, chapter 6.
repeated Dictionary dictionaries = 2;
}

// A request message for
// [Capabilities.GetCapabilities][build.bazel.remote.execution.v2.Capabilities.GetCapabilities].
message GetCapabilitiesRequest {
Expand Down Expand Up @@ -1984,7 +2025,7 @@ message Compressor {
// not need to advertise it.
IDENTITY = 0;

// Zstandard compression.
// Zstandard compression without dictionary.
ZSTD = 1;

// RFC 1951 Deflate. This format is identical to what is used by ZIP
Expand All @@ -1997,6 +2038,16 @@ message Compressor {

// Brotli compression.
BROTLI = 3;

// Zstandard compression with dictionary.
sluongng marked this conversation as resolved.
Show resolved Hide resolved
//
sluongng marked this conversation as resolved.
Show resolved Hide resolved
// Servers which support this compressor MUST implement
// [ContentAddressableStorage.GetZstdDictionaries][build.bazel.remote.execution.v2.ContentAddressableStorage.GetZstdDictionaries].
//
// Clients which support this compressor SHOULD call `GetZstdDictionaries`
// rpc to obtain the dictionaries from the server to be used for
// ZSTD compression/decompression.
ZSTD_DICT = 4;
}
}

Expand Down