-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlobReader not buffering properly. #462
Comments
@Megabytemb Thanks for the report! As a first pass, I tried reproducing the issue without using FWIW, my test never needs to call |
Sorry for the silence on this issue, but i've been doing some more troubleshooting on exactly when the buffer is invalidated. I think the issue is to do with a combination of seeking to the beginning and end of the file, and what whence is used. as a play through example,
I haven't cracked exactly how the seeking relationship should work yet, but i'm finding more reasons why its failing. |
Ok, getting closer still it looks like the built-in I've implemented Tell by simply doing the following
The above has stopped the GCS BlobReader cache from constantly being trashed and rebuilt every chunk request from the GoogleApiClient. Its still downloading the file from Cloud storage about 4 times, rather then once, but i'm getting closer. Any input would be really appreciated. |
Hi, sorry for the significant delay. It looks like GoogleAPIClient, as you've noted, is doing a seek to the end of the file during the init() to find out how long it is. This does erase the buffer, because once you seek to the end of the file we assume the buffer for an earlier spot is no longer needed. I don't yet see why it is causing a problem here, however, since your download should not have started at all at the time the MediaIoBaseUpload is created. What MediaIoBaseUpload's constructor will do is force a reload() on the blob, because the blob's metadata has not yet been downloaded and so the blob's length is unknown. You can skip this reload() by using bucket.get_blob() to get the blob's metadata information ahead of time. The built-in tell() should be using a relative seek() that changes position 0 bytes to find out where it is. This is our primary suspect. I don't think that seek() which doesn't change the position should ever invalidate our buffer, but if it reimplementing tell fixed it for you, maybe that is what is happening in this case. Given the delay in response I'll certainly understand if you have totally moved on, but if you are still interested, please let me know and we can resume digging. |
@andrewsg I also ran into similar problems. The below code is an easy reproduction that will cause invalidations once per loop:
The goal of the code is straightforward, that is to read from a remote tar stored on GCS. The above code will cause issues when tarfile internally calls tell() to get tar offsets. You can replace the bucket and tar with any and it will do. The above code was run on the following: |
Hey @andrewsg, The sample code i provide is exactly what i implemented in my production code, and it's worked beautifully for me.
|
@andrewsg I have provided a PR that fixes the issue, as well as providing a test that will break the current implementation. |
@Megabytemb Thanks again for your original report. I am looking at accepting @allenc97 's PR which should resolve the issue. However, I suspect that accepting this PR will break your solution that accesses |
* tests (fileio): add tarfile based test case for reader seek Background info: #462 * fix: Patch blob reader to return correct seek/tell values (#462) Co-authored-by: Andrew Gorcester <[email protected]>
Environment details
Python 3.9.5
pip 21.1.1
google-cloud-storage
:Version: 1.38.0
Steps to reproduce
When trying to stream a file from Google Cloud Storage to Google Drive, the BlobReader doesn't appear to be buffering properly.
Reading through the blobReader code, it should buffer the file as per chunksize, then download new chuncks as that buffer is exausted. However my experience is that every time the blobwriter is read a 2nd time, it invalidates the buffer, and downloads a new chunk.
the Google API
MediaIoBaseUpload
appears to be requesting files in8192
bytes chunks, and every time the next chunk is requested from the GCS BlobReader, it downloads the next 40Mb chunk, rather than reading from the buffer.My debugging has found that the buffer is actually being invalided when the python HTTP class 'seeks' the next chunk, and the math is failing here, however, i'm unsure what should be happening.
Code example below demonstrates the problem, just provide your own client credentials for Drive, and upload a CSV file to Google Cloud storage, and note the blob and bucket.
Code example
Example Logs
The text was updated successfully, but these errors were encountered: