Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERY slow large blob downloads #10572

Closed
argonaut76 opened this issue Mar 30, 2020 · 23 comments
Closed

VERY slow large blob downloads #10572

argonaut76 opened this issue Mar 30, 2020 · 23 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@argonaut76
Copy link

argonaut76 commented Mar 30, 2020

I am confused about how to optimize BlobClient for downloading large blobs (up to 100 GB).

For example, on a ~480 MB blob the following code takes around 4 minutes to execute:

full_path_to_file = '{}/{}'.format(staging_path,blob_name)
blob = BlobClient.from_connection_string(conn_str=connection_string, container_name=container_name, blob_name=blob_name)
with open(full_path_to_file, "wb") as my_blob:
    download_stream = blob.download_blob()
    result = my_blob.write(download_stream.readall())

In the previous version of the SDK I was able to specify a max_connections parameter that sped download significantly. This appears to have been removed (along with progress callbacks, which is annoying). I have files upwards of 99 GB which will take almost 13 hours to download at this rate, whereas I used to be able to download similar files in under two hours.

How can I optimize the download of large blobs?

Thank you!

Edit: I meant that it took 4 minutes to download a 480 megabyte file. Also, I am getting memory errors when trying to download larger files (~40 GB).

@kaerm kaerm added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Storage Storage Service (Queues, Blobs, Files) labels Mar 31, 2020
@kaerm
Copy link
Contributor

kaerm commented Mar 31, 2020

@argonaut76 thanks for reaching out to us, I am tagging the right team to take a look at this @xgithubtriage

@argonaut76
Copy link
Author

argonaut76 commented Mar 31, 2020

Thanks, @kaerm. After some digging in the docs I switched over to readinto:

for blob_name in blob_names:
    full_path_to_file = '{}/{}'.format(staging_path,blob_name)
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
    with open(full_path_to_file, "wb") as fp:
        blob_client.download_blob().readinto(fp)

Which seems to have helped a bit with the speed, and resolved the memory error. However, speeds are still slow and the downloads now hang periodically. For instance, I've been stuck at 1.6 GB on a 44 GB file for over a half hour.

Update: After an hour and ten minutes the download on this file failed with the following error:
ERROR: <class 'azure.core.exceptions.HttpResponseError'>

Update 2: Tried again on a local machine (vs. a cloud VM) and had the same problems - slow and hung downloads. Additional error information:

Traceback (most recent call last):
  File "/home/jason/.local/lib/python3.6/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/home/jason/.local/lib/python3.6/site-packages/urllib3/response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.6/http/client.py", line 459, in read
    n = self.readinto(b)
  File "/usr/lib/python3.6/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jason/.local/lib/python3.6/site-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/jason/.local/lib/python3.6/site-packages/urllib3/response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/jason/.local/lib/python3.6/site-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/jason/.local/lib/python3.6/site-packages/urllib3/response.py", line 430, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='magasfuturescout.blob.core.windows.net', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jason/.local/lib/python3.6/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 123, in __next__
    chunk = next(self.iter_content_func)
  File "/home/jason/.local/lib/python3.6/site-packages/requests/models.py", line 758, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='magasfuturescout.blob.core.windows.net', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jason/.local/lib/python3.6/site-packages/azure/storage/blob/_download.py", line 47, in process_content
    content = b"".join(list(data))
  File "/home/jason/.local/lib/python3.6/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 140, in __next__
    if resp.status_code == 416:
AttributeError: 'PipelineResponse' object has no attribute 'status_code'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "/home/jason/.local/lib/python3.6/site-packages/azure/storage/blob/_download.py", line 560, in readinto
    downloader.process_chunk(chunk)
  File "/home/jason/.local/lib/python3.6/site-packages/azure/storage/blob/_download.py", line 125, in process_chunk
    chunk_data = self._download_chunk(chunk_start, chunk_end - 1)
  File "/home/jason/.local/lib/python3.6/site-packages/azure/storage/blob/_download.py", line 203, in _download_chunk
    chunk_data = process_content(response, offset[0], offset[1], self.encryption_options)
  File "/home/jason/.local/lib/python3.6/site-packages/azure/storage/blob/_download.py", line 49, in process_content
    raise HttpResponseError(message="Download stream interrupted.", response=data.response, error=error)
azure.core.exceptions.HttpResponseError: Download stream interrupted.

So it's a timeout problem, but what could be causing that?

@xiafu-msft
Copy link
Contributor

Hi @argonaut76

Thanks for reaching out!
The parameter max_connections is renamed to max_concurrency please try again with that parameter!

@argonaut76
Copy link
Author

argonaut76 commented Mar 31, 2020

Thanks, @xiafu-msft. A few questions:

  1. Where is max_concurrency set?
  2. How do I go about setting a good value? Is it based on available threads?
  3. Will setting this parameter fix the hanging issues?

Edit: Sorry, meant where is max_concurrency set?

@mockodin
Copy link

mockodin commented Apr 5, 2020

I experienced timeouts on larger downloads as well >100GB commonly and >200GB would always fail, when using .readall(), more on that below. Of note, max_concurrency did NOT resolve this for me. For me it seems that the Auth header timestamp got older than the accepted 25 minute age limit. So the client isn't updating the header automatically. I was able to work around it, in a ugly manner.

  1. Download in 1GB Range-Based Chunking
    download_blob(offset=start, length=end).download_to_stream(MemBlob, max_concurrency=12)
  2. Overwrite the retry settings, BlobServiceClient.from_connection_string(), immediately fail (might be the cause of the timeout to begin with)
  3. Validate the segment size is the size received
  4. If an exception is thrown or the segment not the expected size (last segment will be almost always be smaller of course) then reauth and retry the last segment again

Rinse and repeat till the download completes. Note I build a checksum as I download since I know the checksum of the original file so I have high confidence of file integrity and validate at the end.
Performance wise on a 1Gbps link for a single blob out of cool storage I get ~430Mbps / 53.75MB/s. Azure side cool tier is 60MB/s limit or there about so it seems to work pretty well.

@xiafu-msft
Copy link
Contributor

Hi @mockodin
Thanks for your workaround!
I guess setting read_timeout="a large number in second" when you initiate BlobServiceClient or BlobClient would help! Current the read_timeout is default to 2000. Also setting max_concurrency would help to increase the speed probably. While we found some place we want to optimize for download.. sorry about the inconvenience..

@xiafu-msft
Copy link
Contributor

xiafu-msft commented Apr 6, 2020

Hi @argonaut76

  1. blob_client.download_blob(max_concurrency=3).readinto()
  2. Currently we don't have a recommended value, probably you can try 5 first and see if it help?
  3. I think set read_timeout="a large number in second" when you initiate BlobServiceClient or BlobClient. eg.BlobClient(account_url, container_name, blob_name, credential, read_timeout=8000)
    I guess hanging there was because the current read_timeout is default to 2000, which is approximated to half an hour. and after that it retries... then it's hanging there for an hour..

@Petermarcu Petermarcu added the Service Attention Workflow: This issue is responsible by Azure service team. label Apr 17, 2020
@ghost
Copy link

ghost commented Apr 17, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

@lmazuel lmazuel added the question The issue doesn't require a change to the product in order to be resolved. Most issues start as that label May 4, 2020
@xiafu-msft
Copy link
Contributor

It looks like this issue is resolved? We would close this issue for now. Feel free to reopen it if you still have the problem or have any question!

@oersted
Copy link

oersted commented Dec 8, 2020

I'm still definitely experiencing this issue with azure-storage-blob==12.6.0, the symptoms are exactly the same as described above: most large downloads getting stuck and eventually crashing with an exception.

I haven't tried the workarounds suggested above just yet, but the issue is clearly not resolved and it makes it rather unusable in a production setting. I would suggest reopening the issue.

@xiafu-msft xiafu-msft reopened this Dec 8, 2020
@xiafu-msft
Copy link
Contributor

Hi @oersted

Thanks for reaching out. we've reopened the issue.
May I know the

  1. size of blob you are trying to download
  2. when did the crash happens (eg. 15min after you started download)
  3. your network speed

So I can try to reproduce this

@ivanst0
Copy link

ivanst0 commented Dec 9, 2020

I've also seen socket.timeout: The read operation timed out error happening intermittently over last few months. The last failure happened yesterday while downloading a relatively small block blob (7 MB). This was part of a DevOps pipeline executing on a Microsoft-hosted build agent. I don't have exact timing as the pipeline step downloads multiple blobs. It seems that even blobs which were retrieved successfully were downloading much slower than usual (downloading all the blobs usually takes about 3 min, this time the exception occurred 50 min after download start).

@robertdavidrowland
Copy link

robertdavidrowland commented Dec 9, 2020

This morning I had trouble pulling 0.8GB files. The code appeared to be hanging with no response leaving a 0 byte file on disk. I was using:

download_file.write(blob_client.download_blob().readall())

After reading the comments in the thread from April I switched this to:

blob_client.download_blob(max_concurrency=10).readinto(download_file)

For good measure I set read_timeout=7200 on the BlobServiceClient (from which I get the BlobClient). I found after a few files my script would hang again, removed this and that problem went away 🤷

Files are now taking ~1 minutes each to download which is fine for our purposes.

@oersted
Copy link

oersted commented Dec 15, 2020

@xiafu-msft Apologies for the late response. Actually, it looks like the symptom is not exactly the same. The download hangs right at the end, but checking the file size reveals that the full file was downloaded, the readinto function just doesn't return for larger files.

I am running some experiments to verify that the issue is not in my code and to provide you with file sizes where it happens / doesn't happen.

@oersted
Copy link

oersted commented Dec 15, 2020

I have verified that it is indeed not my code, I have simplified the execution as much as possible to isolate the issue.

    container, blob = source_path[len('az://'):].split('/', 1)
    blob = BlobClient.from_connection_string(conn_str=conn_str, container_name=container, blob_name=blob)
    download_stream = blob.download_blob(max_concurrency=DOWNLOAD_THREADS)

    try:
        os.makedirs(os.path.dirname(target_path))
    except FileExistsError:
        pass

    with open(target_path, 'wb') as target_file:
        download_stream.readinto(target_file)

In terms of file sizes, here are some examples. The issue seems to be deterministic, as far as I know, the same files got stuck on every run.

Stuck:

  • 108GB
  • 67GB
  • 48GB
  • 43GB
  • 30GB
  • 17GB

Not Stuck:

  • 2.7GB
  • 291MB
  • 61MB
  • 21MB
  • 16MB
  • 8.4MB
  • 6MB
  • 5.1MB

Again, the download gets stuck right at the end. The full file is downloaded but readinto does not return.

I might try reducing the concurrency later. I am using 160 download threads (I have 40 cores so it's not unreasonable), download speed is great, but based on the symptoms it feels like there is an issue joining all the threads at the end.

@Vasi-Shche
Copy link

@mockodin

Performance wise on a 1Gbps link for a single blob out of cool storage I get ~430Mbps / 53.75MB/s. Azure side cool tier is 60MB/s limit or there about so it seems to work pretty well.

Where did you get 60MB/s limit for cool tier? I don't see such numbers in the docs:
https://docs.microsoft.com/en-us/azure/storage/common/scalability-targets-standard-account
https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets

Blob download speed seems to be limited by block size, number of threads and network.

@mockodin
Copy link

This was from an older documentation (which did not specifically state cool as I recall) and appears to have been removed with newer tiers options.

a segment of the old doc from ~2016: https://github.com/uglide/azure-content/blob/master/includes/azure-storage-limits.md
a reference to it being removed and why: MicrosoftDocs/azure-docs#27901

An updated doc on performance doesn't appear to be available at present, at least I didn't find it in a few minutes of searching. Presumably one exists somewhere, if for no other reason than that the backend would not be infinite for a single blob, it exists somewhere on a shared array of disk and to not throttle would be unmanageable at scale (cache and load balancers only take you so far)

A note too that most of the high speed data out there relates specifically to chunked parallel writes, or for super high speed (in either direction) a file being spread across multiple blobs to achieve those multi-10s-of-Gbps numbers that get referenced which is impressive and certainly achievable but not practical for many (most?) use cases.

Looking at Azure Files gives a fairly good guide to performance since that is blob storage on that backend.

@amishra-dev amishra-dev added bug This issue requires a change to an existing behavior in the product in order to be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jan 29, 2021
@u-ashish
Copy link

u-ashish commented Apr 7, 2021

Hi @xiafu-msft

I think I have a related issue... I'm running into the same issue on a roughly 41GB file.

This is using azure-storage-blob = "^12.8.0"

azure_storage = BlobServiceClient(
    account_name=AZURE_ACCOUNT_NAME,
    account_key=AZURE_ACCOUNT_KEY,
    account_url=ACCOUNT_URL,
    credential=AZURE_ACCOUNT_KEY,
    max_chunk_get_size=TRANSFER_CHUNK_SIZE,
    max_single_get_size=TRANSFER_CHUNK_SIZE,
    read_timeout=36000
)

container_client = azure_storage.get_container_client(
        container_name
)

for blob in container_client.list_blobs():
    blob_client = azure_storage.get_blob_client(
            container=container_name,
            blob=blob.name
     )
     data = blob_client.download_blob(max_concurrency=6)
     for chunk in data.chunks():
         # do stuff with each chunk

This works on < 500MB files fine, but with a 41GB file it fails after about 15-20 min saying:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 575, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 540, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 454, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_download.py", line 47, in process_content
    content = b"".join(list(data))
  File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 118, in __next__
    chunk = next(self.iter_content_func)
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

I thought setting read_timeout in the BlobServiceClient would help but this exception happened after 15-20 minutes of the process running.

  • Is the timeout not being respected? Or is the connection reset issue separate from the read timeout? I know that this in general can be a network issue for long requests from large files, but it happens pretty consistently at the 15-20 min mark so I thought it may be related.
  • Should I using something other than .chunks()? I noticed it wasn't documented in the docs but reading the code it seemed like what I needed to use (I am streaming the data and writing it elsewhere).
  • Is there something in the Azure networking middleware that's terminating the connection after this window?
  • Does it matter if the request is originating from AWS?

@u-ashish
Copy link

u-ashish commented Apr 8, 2021

I also tried the following things which did not help

  • Tried to set a custom timeout during download_blob i.e.
data = blob_client.download_blob(
    max_concurrency=20,
    timeout=72000
)
  • Tried to play with lower or higher chunk sizes from 4MB all the way to 100MB
  • Tried to tweak the max concurrency

It seems to be happening on large (10GB+ files) but feels guaranteed to happen on something greater than 20GB, from my fiddling.

@xiafu-msft
Copy link
Contributor

xiafu-msft commented Apr 9, 2021

Hi @u-ashish
Thanks for the feedback! It seems we found the problem and we are working on the fix and get back to you soon! sorry about the inconvenience. This is related to the change https://github.com/Azure/azure-sdk-for-python/pull/17078/files

@u-ashish
Copy link

@xiafu-msft thanks so much for getting back to me -- that makes perfect sense. I'll keep an eye out for any related PRs.

@amishra-dev
Copy link

The fix has been merged.

@delahondes
Copy link

Building on @mockodin fine remarks I implemented a file like object on top of blob object, and I was very successful (it does not the reauth trick he mentionned because I did not need that), the downloading speed was enhanced maybe ten times when using this iterator vs the one included in the SDK, many thanks to you, @mockodin !

class ObjectFile:
    """An ObjectFile in object storage that can be opened and closed.
    See Objects.open()"""
    def __init__(self, name, client,mode, size):
        """Initialize the Object object with a name and a blob_client
        mode is w or r, size is the blob size.
        """
        self.name = name
        self.client = client
        self.block_list = []
        self.mode=mode
        self.__open__=True
        if mode=='r':
            self.write = forbid('write', 'r')
        elif mode=='w':
            self.__iter__ = forbid('__iter__', 'w')
            self.read = forbid('read', 'w')
        self.pos = 0
        self.size = size


    def write(self, chunk):
        """Write a chunk of data (a part of the data) into the object"""
        block_id = str(uuid.uuid4())
        self.client.stage_block(block_id=block_id, data=chunk)
        self.block_list.append(BlobBlock(block_id=block_id))
    
    def close(self):
        """Finalise the object"""
        if self.mode=='w':
            self.client.commit_block_list(self.block_list)
        self.__open__=False

    def __del__(self):
        if self.__open__:
            self.close()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.__open__:
            self.close()

    def __iter__(self):
        self.pos=0

        #stream = self.client.download_blob(max_concurrency=10)
        return self

    def __next__(self):
        data = BytesIO()
        if self.pos>=self.size:
            raise StopIteration()
        elif self.pos+CHUNK_SIZE>self.size:
            size=self.size-self.pos
        else:
            size=CHUNK_SIZE
        self.client.download_blob(offset=self.pos, length=size
            ).download_to_stream(data, max_concurrency=12)
        self.pos += size
        return data.getvalue()
        
    def read(self, size=None):
        if size is None:
            return self.client.download_blob().readall()
        else:
            if self.pos>=self.size:
                return ''
            elif self.pos+size>self.size:
                size=self.size-self.pos
            data = BytesIO()
            self.client.download_blob(offset=self.pos, length=size
                ).download_to_stream(data, max_concurrency=12)
            self.pos += size
            return data.getvalue()

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests