-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERY slow large blob downloads #10572
Comments
@argonaut76 thanks for reaching out to us, I am tagging the right team to take a look at this @xgithubtriage |
Thanks, @kaerm. After some digging in the docs I switched over to readinto:
Which seems to have helped a bit with the speed, and resolved the memory error. However, speeds are still slow and the downloads now hang periodically. For instance, I've been stuck at 1.6 GB on a 44 GB file for over a half hour. Update: After an hour and ten minutes the download on this file failed with the following error: Update 2: Tried again on a local machine (vs. a cloud VM) and had the same problems - slow and hung downloads. Additional error information:
So it's a timeout problem, but what could be causing that? |
Hi @argonaut76 Thanks for reaching out! |
Thanks, @xiafu-msft. A few questions:
Edit: Sorry, meant where is max_concurrency set? |
I experienced timeouts on larger downloads as well >100GB commonly and >200GB would always fail, when using .readall(), more on that below. Of note, max_concurrency did NOT resolve this for me. For me it seems that the Auth header timestamp got older than the accepted 25 minute age limit. So the client isn't updating the header automatically. I was able to work around it, in a ugly manner.
Rinse and repeat till the download completes. Note I build a checksum as I download since I know the checksum of the original file so I have high confidence of file integrity and validate at the end. |
Hi @mockodin |
Hi @argonaut76
|
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. |
It looks like this issue is resolved? We would close this issue for now. Feel free to reopen it if you still have the problem or have any question! |
I'm still definitely experiencing this issue with I haven't tried the workarounds suggested above just yet, but the issue is clearly not resolved and it makes it rather unusable in a production setting. I would suggest reopening the issue. |
Hi @oersted Thanks for reaching out. we've reopened the issue.
So I can try to reproduce this |
I've also seen |
This morning I had trouble pulling 0.8GB files. The code appeared to be hanging with no response leaving a 0 byte file on disk. I was using:
After reading the comments in the thread from April I switched this to:
Files are now taking ~1 minutes each to download which is fine for our purposes. |
@xiafu-msft Apologies for the late response. Actually, it looks like the symptom is not exactly the same. The download hangs right at the end, but checking the file size reveals that the full file was downloaded, the I am running some experiments to verify that the issue is not in my code and to provide you with file sizes where it happens / doesn't happen. |
I have verified that it is indeed not my code, I have simplified the execution as much as possible to isolate the issue.
In terms of file sizes, here are some examples. The issue seems to be deterministic, as far as I know, the same files got stuck on every run. Stuck:
Not Stuck:
Again, the download gets stuck right at the end. The full file is downloaded but I might try reducing the concurrency later. I am using 160 download threads (I have 40 cores so it's not unreasonable), download speed is great, but based on the symptoms it feels like there is an issue joining all the threads at the end. |
Where did you get 60MB/s limit for cool tier? I don't see such numbers in the docs: Blob download speed seems to be limited by block size, number of threads and network. |
This was from an older documentation (which did not specifically state cool as I recall) and appears to have been removed with newer tiers options. a segment of the old doc from ~2016: https://github.com/uglide/azure-content/blob/master/includes/azure-storage-limits.md An updated doc on performance doesn't appear to be available at present, at least I didn't find it in a few minutes of searching. Presumably one exists somewhere, if for no other reason than that the backend would not be infinite for a single blob, it exists somewhere on a shared array of disk and to not throttle would be unmanageable at scale (cache and load balancers only take you so far) A note too that most of the high speed data out there relates specifically to chunked parallel writes, or for super high speed (in either direction) a file being spread across multiple blobs to achieve those multi-10s-of-Gbps numbers that get referenced which is impressive and certainly achievable but not practical for many (most?) use cases. Looking at Azure Files gives a fairly good guide to performance since that is blob storage on that backend. |
Hi @xiafu-msft I think I have a related issue... I'm running into the same issue on a roughly 41GB file. This is using
This works on < 500MB files fine, but with a 41GB file it fails after about 15-20 min saying:
I thought setting
|
I also tried the following things which did not help
It seems to be happening on large (10GB+ files) but feels guaranteed to happen on something greater than 20GB, from my fiddling. |
Hi @u-ashish |
@xiafu-msft thanks so much for getting back to me -- that makes perfect sense. I'll keep an eye out for any related PRs. |
The fix has been merged. |
Building on @mockodin fine remarks I implemented a file like object on top of blob object, and I was very successful (it does not the reauth trick he mentionned because I did not need that), the downloading speed was enhanced maybe ten times when using this iterator vs the one included in the SDK, many thanks to you, @mockodin ! class ObjectFile:
"""An ObjectFile in object storage that can be opened and closed.
See Objects.open()"""
def __init__(self, name, client,mode, size):
"""Initialize the Object object with a name and a blob_client
mode is w or r, size is the blob size.
"""
self.name = name
self.client = client
self.block_list = []
self.mode=mode
self.__open__=True
if mode=='r':
self.write = forbid('write', 'r')
elif mode=='w':
self.__iter__ = forbid('__iter__', 'w')
self.read = forbid('read', 'w')
self.pos = 0
self.size = size
def write(self, chunk):
"""Write a chunk of data (a part of the data) into the object"""
block_id = str(uuid.uuid4())
self.client.stage_block(block_id=block_id, data=chunk)
self.block_list.append(BlobBlock(block_id=block_id))
def close(self):
"""Finalise the object"""
if self.mode=='w':
self.client.commit_block_list(self.block_list)
self.__open__=False
def __del__(self):
if self.__open__:
self.close()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
if self.__open__:
self.close()
def __iter__(self):
self.pos=0
#stream = self.client.download_blob(max_concurrency=10)
return self
def __next__(self):
data = BytesIO()
if self.pos>=self.size:
raise StopIteration()
elif self.pos+CHUNK_SIZE>self.size:
size=self.size-self.pos
else:
size=CHUNK_SIZE
self.client.download_blob(offset=self.pos, length=size
).download_to_stream(data, max_concurrency=12)
self.pos += size
return data.getvalue()
def read(self, size=None):
if size is None:
return self.client.download_blob().readall()
else:
if self.pos>=self.size:
return ''
elif self.pos+size>self.size:
size=self.size-self.pos
data = BytesIO()
self.client.download_blob(offset=self.pos, length=size
).download_to_stream(data, max_concurrency=12)
self.pos += size
return data.getvalue() |
I am confused about how to optimize BlobClient for downloading large blobs (up to 100 GB).
For example, on a ~480 MB blob the following code takes around 4 minutes to execute:
In the previous version of the SDK I was able to specify a max_connections parameter that sped download significantly. This appears to have been removed (along with progress callbacks, which is annoying). I have files upwards of 99 GB which will take almost 13 hours to download at this rate, whereas I used to be able to download similar files in under two hours.
How can I optimize the download of large blobs?
Thank you!
Edit: I meant that it took 4 minutes to download a 480 megabyte file. Also, I am getting memory errors when trying to download larger files (~40 GB).
The text was updated successfully, but these errors were encountered: