Storage: google-resumable-media==0.4.0 Breaks Gzipped Downloads #9188

william-silversmith · 2019-09-06T23:02:37Z

Environment details

Specify the API at the beginning of the title (for example, "BigQuery: ...")
General, Core, and Other are also allowed as types

Google Storage blob.py

OS type and version

Ubuntu 14.04

Python version and virtual environment information: python --version

Python 3.6.8

google-cloud- version: pip show google-<service> or pip freeze

google-cloud-storage==1.19.0

Steps to reproduce

pip install google-resumable-media==0.4.0
blob.download_as_string()

Per the latest release of google-resumable-media, no decompression of content-encoding gzip is performed and raw bytes are returned.

See https://github.com/googleapis/google-resumable-media-python/releases

blob.download_as_string() formerly returned decompressed bytes, and now returns compressed bytes. We are using .blob instead of .get_blob for an HPC application and thus have no way of knowing what the content encoding is as the information is erased.

We actually LIKE the new functionality as we can now decide when to decompress, but we need to know the content encoding to avoid various kinds of problems that would be introduced by speculative decompression.

Code example

Here is our desired functionality.

    blob = bucket.blob( key )
    try:
      # blob handles the decompression so the encoding is None
      resp = blob.download_as_string(start=start, end=end)
      return resp, blob.content_encoding
    except google.cloud.exceptions.NotFound as err:
      return None, None

Adding this patch to google.cloud.storage.blob.py would solve this problem for us:

    def _do_download(
        self, transport, file_obj, download_url, headers, start=None, end=None
    ):
        """Perform a download without any error handling.

        This is intended to be called by :meth:`download_to_file` so it can
        be wrapped with error handling / remapping.

        :type transport:
            :class:`~google.auth.transport.requests.AuthorizedSession`
        :param transport: The transport (with credentials) that will
                          make authenticated requests.

        :type file_obj: file
        :param file_obj: A file handle to which to write the blob's data.

        :type download_url: str
        :param download_url: The URL where the media can be accessed.

        :type headers: dict
        :param headers: Optional headers to be sent with the request(s).

        :type start: int
        :param start: Optional, the first byte in a range to be downloaded.

        :type end: int
        :param end: Optional, The last byte in a range to be downloaded.
        """
        if self.chunk_size is None:
            download = Download(
                download_url, stream=file_obj, headers=headers, start=start, end=end
            )
>           response = download.consume(transport)
>           if 'Content-Encoding' in response.headers:
>               self.content_encoding = response.headers['Content-Encoding']
        else:
            download = ChunkedDownload(
                download_url,
                self.chunk_size,
                file_obj,
                headers=headers,
                start=start if start else 0,
                end=end,
            )

            while not download.finished:
                download.consume_next_chunk(transport)

The text was updated successfully, but these errors were encountered:

tseaver · 2019-09-09T16:30:06Z

@william-silversmith Can you achieve what you need to reloading the blob's metadata? E.g.:

blob.reload()

At that point, the content_encoding property will be populated from the server.

william-silversmith · 2019-09-09T16:52:36Z

Unfortunately, this strategy would significantly reduce IO performance. Google support worked with us a few months ago to figure out how to reduce the number of requests (hundreds of millions of files). Originally we were using get_blob which generated an additional request. We're handling a petabyte of 3D image data with a random access requirement. We do this by chunking the image into a regular grid of files. In order to make this more affordable, some lossless compression is desirable. The reason it would be desirable to control when decompression occurs is that our method for transferring datasets currently requires decompressing and recompressing which seems a waste. I do think it's worth being a bit more explicit though: 1. The updated resumable media is breaking existing functionality for all users that use gzip (probably a lot!) as raw bytes are now returned. 2. It was broken in the spirit of letting users decide what to do with the data (a new feature), but blob is stripping that info away. I would be happy with the old functionality being restored, but if there's some option to decide when decompression occurs without additional network overhead, that would be even better.

…

On Mon, Sep 9, 2019, 12:30 PM Tres Seaver ***@***.***> wrote: @william-silversmith <https://github.com/william-silversmith> Can you achieve what you need to reloading the blob's metadata? E.g.: blob.reload() At that point, the content_encoding property will be populated from the server. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9188?email_source=notifications&email_token=AATGQSMSJ2CBE6IZTKBPH7DQIZ2Z7A5CNFSM4IUN2UB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6IHH5A#issuecomment-529560564>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATGQSNK2I25ENXO6BXYHEDQIZ2Z7ANCNFSM4IUN2UBQ> .

tseaver · 2019-09-09T20:42:45Z

@william-silversmith Thanks for clarifying. One issue here is that we would want to have the header-driven content_encoding value also available for chunked downloads, so we would need to rework your patch a bit.

@crwilcox, @frankyn Please chime in.

frankyn · 2019-09-17T14:20:17Z

@crwilcox reverted changes made in googleapis/google-resumable-media-python#103 and releasing a new version to unblock this issue: googleapis/google-resumable-media-python#104

Reassigning to him.

crwilcox · 2019-09-17T16:49:24Z

Hi @william-silversmith I have released v0.4.1 that backs out this change.

Due to the way we have pinned this package within google-cloud-storage, we are rethinking the way we make this change to avoid disrupting folks using existing libraries.

william-silversmith · 2019-09-17T17:11:12Z

Thank you very much!

yoshi-automation added the triage me I really want to be triaged. label Sep 7, 2019

tseaver changed the title ~~storage.blob: google-resumable-media==0.4.0 Breaks Gzipped Downloads~~ Storage: google-resumable-media==0.4.0 Breaks Gzipped Downloads Sep 10, 2019

tseaver assigned frankyn Sep 11, 2019

frankyn added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Sep 17, 2019

frankyn assigned crwilcox and unassigned frankyn Sep 17, 2019

yoshi-automation added 🚨 This issue needs some love. triage me I really want to be triaged. labels Sep 17, 2019

crwilcox mentioned this issue Sep 17, 2019

Reimplement changes undone by #103 (always use raw response data) googleapis/google-resumable-media-python#106

Closed

crwilcox closed this as completed Sep 17, 2019

william-silversmith mentioned this issue Oct 27, 2019

Storage: Cannot download brotli compressed/encoded files from cloud storage #9003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage: google-resumable-media==0.4.0 Breaks Gzipped Downloads #9188

Storage: google-resumable-media==0.4.0 Breaks Gzipped Downloads #9188

william-silversmith commented Sep 6, 2019 •

edited

Loading

tseaver commented Sep 9, 2019

william-silversmith commented Sep 9, 2019 via email

tseaver commented Sep 9, 2019

frankyn commented Sep 17, 2019 •

edited

Loading

crwilcox commented Sep 17, 2019

william-silversmith commented Sep 17, 2019

Storage: google-resumable-media==0.4.0 Breaks Gzipped Downloads #9188

Storage: google-resumable-media==0.4.0 Breaks Gzipped Downloads #9188

Comments

william-silversmith commented Sep 6, 2019 • edited Loading

Environment details

Steps to reproduce

Code example

tseaver commented Sep 9, 2019

william-silversmith commented Sep 9, 2019 via email

tseaver commented Sep 9, 2019

frankyn commented Sep 17, 2019 • edited Loading

crwilcox commented Sep 17, 2019

william-silversmith commented Sep 17, 2019

william-silversmith commented Sep 6, 2019 •

edited

Loading

frankyn commented Sep 17, 2019 •

edited

Loading