-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require that early-return compressed-blobs bytestream uploads set committed_size -1 #213
Require that early-return compressed-blobs bytestream uploads set committed_size -1 #213
Conversation
…mitted_size -1 We require that uncompressed bytestream uploads specify committed_size set to the size of the blob when returning early (if the blob already exists on the server). We also require that for compressed bytestream uploads committed_size refers to the initial write offset plus the number of compressed bytes uploaded. But if the server wants to return early in this case it doesn't know how many compressed bytes would have been uploaded (the client might not know this ahead of time either). So let's require that the server set committed_size to -1 in this case. For early return to work, we also need to ensure that the server does not return an error code. Resolves bazelbuild#212.
IIUC, based on the reasoning in #212 (comment) we don't want to return But it seems like this approach of returning |
Just spitballing here, but another option might be to make use of the new |
To clarify, I think we should maintain backwards compatibility for uncompressed uploads (ie clients can check for committed_size == digest size), but we need to make a breaking change for compressed uploads because some existing clients are always going to be broken in the current setup unless we forbid servers from doing an early-return for compressed uploads (which doesn't sound great to me). This feature is still experimental in bazel, so IMO we can decide on a reasonable behaviour that supports early-return for compressed uploads without changing the requirements for uncompressed uploads, and then submit a fix for bazel. I think specifying -1 for early-returned compressed uploads is simple, but I would also be ok with returning AlreadyExists (though it would be a slightly larger bazel-side fix). Re using the bytestream resource metadata field, it's generally disliked (and anyway it would require a client-side fix, which which case we can pick another client side fix). |
CC @AlessandroPatti who implemented this functionality in bazel- it looks like you tried to account for this scenario. Maybe there's a corner case where the uploader sends all the data but the server "returns early" in the last chunk? What do you think about the server returning -1 in this case (or alternatively returning an AlreadyExists error)? |
My first impulse here is to make compressed and uncompressed uploads match. Both should return the total number of uncompressed bytes captured the blob (effectively, the size field in the blob's digest). This means that the server must uncompress the uploaded data to calculate the size, but servers must already do this to validate the digest. Of course, then we get back to the issue of returning compressed vs. uncompressed bytes that plagues our implementation of Bytestream.Write. I think that if we're going to make a breaking change--which it does indeed seem like this situation requires--we should aim for the "correct" semantic of returning AlreadyExists. @EricBurnett for any relevant context on the history of why we didn't do this in the first place. @coeuvre for review on the Bazel side. |
+1 for setting Committed Size to the *uncompressed* size of the blob for
early returns. That also has the advantage of matching
QueryWriteStatusResponse, which should be returning the uncompressed size
for a complete blob irrespective of encoding.
The problem we've had with ByteStream.Write is rooted in
https://github.com/googleapis/googleapis/blob/master/google/bytestream/bytestream.proto#L138-L144
- our use of ByteStream comes with its own field semantics, which don't
match well to the way we use bytestreams for compressed payloads of
logically-uncompressed data. We definitely want to fork away from the
ByteStream proto when we do a breaking change of the API, at which point we
can split the fields to have a clean separation between "logical sizes" and
"transported bytes".
…On Mon, Feb 7, 2022 at 5:45 PM Steven Bergsieker ***@***.***> wrote:
My first impulse here is to make compressed and uncompressed uploads
match. Both should return the total number of uncompressed bytes captured
the blob (effectively, the size field in the blob's digest). This means
that the server must uncompress the uploaded data to calculate the size,
but servers must already do this to validate the digest. Of course, then we
get back to the issue of returning compressed vs. uncompressed bytes that
plagues our implementation of Bytestream.Write.
I think that if we're going to make a breaking change--which it does
indeed seem like this situation requires--we should aim for the "correct"
semantic of returning AlreadyExists.
@EricBurnett <https://github.com/EricBurnett> for any relevant context on
the history of why we didn't do this in the first place.
@coeuvre <https://github.com/coeuvre> for review on the Bazel side.
—
Reply to this email directly, view it on GitHub
<#213 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABREWYNOCVTYJ2KVPUVPQDU2BDSBANCNFSM5NE4PYCQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I don't think there is necessarily a mapping of "N compressed bytes transferred" to "M uncompressed bytes transferred" for arbitrary compression formats, unless you allow for possibly confusing scenarios like "The client wrote a bunch of compressed data successfully, but not enough for the server to decode another frame, so the uncompressed bytes written did not increase". This is why we chose a kind of fake definition of committed_size for compressed blobs (initial uncompressed offset + compressed bytes written).
The downside of that approach is that committed_size might decrease during a successful upload, which feels super confusing (eg it increases monotonically as expected, for ztsd encoded data which turns out to be larger than the uncompressed data, and then the server does an early-return near the end of the transfer). |
Hmmm, looking at this again, I think that returning uncompressed size contradicts the general guidance that we've given that committed_bytes should be (uncompressed offset) + (sum of compressed data bundles). So in that case successful, full compressed uploads should return the compressed data size, while successful, early-terminated compressed uploads would return the uncompressed data size. That makes no sense. Side note: I thought we'd settled on returning the uncompressed data size for compressed uploads, for the reason that Eric mentioned--it's the only value that makes sense to provide as an offset to future calls. Am I misremembering? I continue to believe that returning ALREADY_EXISTS for early termination of compressed uploads is the right approach. I think the only reason we didn't do that for uncompressed uploads is that it would have been a breaking change for Bazel, but if Bazel support for compressed uploads is experimental and we're OK with breaking it, we might as well break it in the right way. |
We decided on this mostly useless value in @nodirg's PR #193 but we missed the early-return corner case. As I see it we probably have two kinds of clients that support compressed-blobs bytestream uploads:
|
Per comments in the meeting: we believe that taking the less-disruptive course of returning -1 is a good idea here. The semantics of compressed vs. uncompressed bytes are already a mess, but we can't meaningfully clean them up until we move away from Bytestream. Note that returning -1 still breaks Bazel's (experimental) support for compressed uploads. |
Adding Sven for approval from Bazel before merging. He can stand in for Chi, who is having trouble with his Github account. |
I see there has been a consensus towards using -1, so I might be late to the party. FWIW, I'd also lean towards using AlreadyExists, which seems the most semantically correct solution here.
@mostynb Could you elaborate on why is that? It seems like it could allow using AlreadyExists while keeping backward comptibility with existing clients |
Using the bytestream metadata came up early in the compressed-blobs feature planning discussion for other purposes, but IIRC it was considered too free-form. We don't know what if anything existing clients use it for so we can't avoid causing potential conflicts. Another point against using the AlreadyExists error code came up in the meeting- apparently it was used in an early (pre v1?) version of the spec, and was removed for some reason that nobody can quite remember, but it was significant enough to make people accept a breaking change. |
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Which is part of the solution to this issue: bazelbuild/bazel#14654
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Here's a bazel-remote build that can be used to test this change: buchgr/bazel-remote#527 Fixes bazelbuild#14654
Here's a bazel-remote PR that can be used to test this change: buchgr/bazel-remote#527 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. My GitHub account was suspended for no reasons so I lost all the notifications during that period.
LGTM from Bazel side.
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Here's a bazel-remote build that can be used to test this change: buchgr/bazel-remote#527 Fixes bazelbuild#14654
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Here's a bazel-remote build that can be used to test this change: buchgr/bazel-remote#527 Fixes #14654 Closes #14870. PiperOrigin-RevId: 430167812
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Here's a bazel-remote build that can be used to test this change: buchgr/bazel-remote#527 Fixes bazelbuild#14654 Closes bazelbuild#14870. PiperOrigin-RevId: 430167812 (cherry picked from commit d184e48)
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Here's a bazel-remote build that can be used to test this change: buchgr/bazel-remote#527 Fixes #14654 Closes #14870. PiperOrigin-RevId: 430167812 (cherry picked from commit d184e48) Co-authored-by: Mostyn Bramley-Moore <[email protected]>
The fix has landed in bazel now, any objections to merging this so I can make a new bazel-remote release? |
This is an implementation of this REAPI spec update: bazelbuild/remote-apis#213 Which is part of the solution to this issue: bazelbuild/bazel#14654
We require that uncompressed bytestream uploads specify committed_size set to the size of the blob when returning early (if the blob already exists on the server).
We also require that for compressed bytestream uploads committed_size refers to the initial write offset plus the number of compressed bytes uploaded. But if the server wants to return early in this case it doesn't know how many compressed bytes would have been uploaded (the client might not know this ahead of time either). So let's require that the server set committed_size to -1 in this case.
For early return to work, we also need to ensure that the server does not return an error code.
Resolves #212.