Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file downloads hang indefinitely #28713

Closed
YuvalItzchakov opened this issue Feb 9, 2023 · 7 comments
Closed

Large file downloads hang indefinitely #28713

YuvalItzchakov opened this issue Feb 9, 2023 · 7 comments
Assignees
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)

Comments

@YuvalItzchakov
Copy link

YuvalItzchakov commented Feb 9, 2023

  • Package Name:
    azure-core 1.25.1
    azure-storage-blob 12.8.1
    azure-storage-common 2.1.0
    azure-storage-file 2.1.0

  • Operating System

    • Debian GNU/Linux 11 (bullseye)
    • MacOS Monterey 12.5.1
  • Python Version:
    Python 3.7.13 (Server) / Python 3.8.13 (Local)

Describe the bug
While trying to download large files >= 17GB, the Python SDK and the Azure CLI respectively get stuck before finishing the download, usually around 99.(3/5)%. This is due to some of the chunks being downloaded hanging and never timing out / retrying.

While trying to download a file with the following parameters, using BlobClient.download_blob:

  • max_concurrency = 20
  • timeout = 1800

The chunks that are stuck, yield the following stack trace (this is from the live stuck process):

Locally:

Thread 0x2A704F000 (idle): "ThreadPoolExecutor-0_6"
    read (ssl.py:1130)
        Arguments:
            self: <SSLSocket at 0x1057e5c00>
            len: 8192
            buffer: <memoryview at 0x1043ea740>
    recv_into (ssl.py:1274)
        Arguments:
            self: <SSLSocket at 0x1057e5c00>
            buffer: <memoryview at 0x1043ea740>
            nbytes: 8192
            flags: 0
    readinto (socket.py:705)
        Arguments:
            self: <SocketIO at 0x105003c70>
            b: <memoryview at 0x1043ea740>
    read (http/client.py:465)
        Arguments:
            self: <HTTPResponse at 0x10501e080>
            amt: 4096
    read (urllib3/response.py:519)
        Arguments:
            self: <HTTPResponse at 0x1050001c0>
            amt: 4096
            decode_content: True
            cache_content: False
        Locals:
            flush_decoder: False
            fp_closed: False
    stream (urllib3/response.py:576)
        Arguments:
            self: <HTTPResponse at 0x1050001c0>
            amt: 4096
            decode_content: True
        Locals:
            data: <bytes at 0x106225600>
    generate (requests/models.py:758)
        Locals:
            chunk: <bytes at 0x106225600>
    next (azure/core/pipeline/transport/_requests_basic.py:169)
        Arguments:
            self: <StreamDownloadGenerator at 0x105002c20>
        Locals:
            internal_response: <Response at 0x105003e20>
    process_content (azure/multiapi/storagev2/blob/v2021_06_08/_download.py:52)
        Arguments:
            data: <StreamDownloadGenerator at 0x105002c20>
            start_offset: 0
            end_offset: 0
            encryption: {"required": False, "key": None, "resolver": None}
    _download_chunk (azure/multiapi/storagev2/blob/v2021_06_08/_download.py:217)
        Arguments:
            self: <_ChunkDownloader at 0x104406470>
            chunk_start: 2835349504
            chunk_end: 2839543807
        Locals:
            download_range: (2835349504, 2839543807)
            offset: (0, 0)
            range_header: "bytes=2835349504-2839543807"
            range_validation: None
            retry_active: True
            retry_total: 3
            _: "primary"
            response: <StreamDownloadGenerator at 0x105002c20>

Remote (server):

Thread 21134 (idle): "ThreadPoolExecutor-0_3"
    poll (/lib/x86_64-linux-gnu/libc-2.31.so)
    0x7f109531b01d (/usr/local/lib/python3.7/lib-dynload/_ssl.cpython-37m-x86_64-linux-gnu.so)
    0x7f109531f81f (/usr/local/lib/python3.7/lib-dynload/_ssl.cpython-37m-x86_64-linux-gnu.so)
    read (/usr/local/lib/python3.7/ssl.py:929)
        Arguments:
            self: <SSLSocket at 0x7f105c194f30>
            len: 8192
            buffer: <memoryview at 0x7f1063464ef0>
    recv_into (/usr/local/lib/python3.7/ssl.py:1071)
        Arguments:
            self: <SSLSocket at 0x7f105c194f30>
            buffer: <memoryview at 0x7f1063464ef0>
            nbytes: 8192
            flags: 0
    readinto (/usr/local/lib/python3.7/socket.py:589)
        Arguments:
            self: <SocketIO at 0x7f104c5c4490>
            b: <memoryview at 0x7f1063464ef0>
    readinto (/usr/local/lib/python3.7/http/client.py:509)
        Arguments:
            self: <HTTPResponse at 0x7f104c5c4250>
            b: <bytearray at 0x7f104c7b6df0>
    read (/usr/local/lib/python3.7/http/client.py:465)
        Arguments:
            self: <HTTPResponse at 0x7f104c5c4250>
            amt: 4096
        Locals:
            b: <bytearray at 0x7f104c7b6df0>
    read (/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py:518)
        Arguments:
            self: <HTTPResponse at 0x7f105c1ac310>
            amt: 4096
            decode_content: True
            cache_content: False
        Locals:
            flush_decoder: False
            fp_closed: False
    stream (/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py:575)
        Arguments:
            self: <HTTPResponse at 0x7f105c1ac310>
            amt: 4096
            decode_content: True
        Locals:
            data: <bytes at 0x7f1050130ae0>
    generate (/home/airflow/.local/lib/python3.7/site-packages/requests/models.py:816)
    __next__ (/home/airflow/.local/lib/python3.7/site-packages/azure/core/pipeline/transport/_requests_basic.py:169)
        Arguments:
            self: <StreamDownloadGenerator at 0x7f104c7b6190>
        Locals:
            internal_response: <Response at 0x7f104c6be810>
    process_content (/home/airflow/.local/lib/python3.7/site-packages/azure/storage/blob/_download.py:52)
        Arguments:
            data: <StreamDownloadGenerator at 0x7f104c7b6190>
            start_offset: 0
            end_offset: 0
            encryption: {"required": False, "key": None, "resolver": None}
    _download_chunk (/home/airflow/.local/lib/python3.7/site-packages/azure/storage/blob/_download.py:211)
        Arguments:
            self: <_ChunkDownloader at 0x7f106937b5d0>
            chunk_start: 10670309376
            chunk_end: 10674503679
        Locals:
            download_range: (10670309376, 10674503679)
            offset: (0, 0)
            range_header: "bytes=10670309376-10674503679"
            range_validation: None
            retry_active: True
            retry_total: 3
            _: "primary"
            response: <StreamDownloadGenerator at 0x7f104c7b6190>
    process_chunk (/home/airflow/.local/lib/python3.7/site-packages/azure/storage/blob/_download.py:129)
        Arguments:
            self: <_ChunkDownloader at 0x7f106937b5d0>
            chunk_start: 10670309376
        Locals:
            chunk_end: 10674503680

And the following error message is printed out (this is from the Azure CLI):

Alive[############################################################### ] 99.8149%
urllib3.connectionpool: Connection pool is full, discarding connection: x.blob.core.windows.net
Alive[############################################################### ] 99.8380%
urllib3.connectionpool: Connection pool is full, discarding connection: x.blob.core.windows.net
Alive[############################################################### ] 99.8612%
urllib3.connectionpool: Connection pool is full, discarding connection: x.blob.core.windows.net
Alive[############################################################### ] 99.8843%
urllib3.connectionpool: Connection pool is full, discarding connection: x.blob.core.windows.net
Alive[############################################################### ] 99.9074%
urllib3.connectionpool: Connection pool is full, discarding connection: x.blob.core.windows.net
Alive[############################################################### ] 99.9306%

Using the Azure CLI this reproduces every time when trying to fetch a large file. When using the Python SDK directly it will sometimes fail and sometimes succeed.

To Reproduce
Steps to reproduce the behavior (Azure CLI):

  1. Create a large file, >= 17 GB
  2. Store it in blob storage
  3. Install Azure CLI
  4. Attempt to download the file using az storage blob download passing the above parameters to the command

Steps to reproduce the behavior (Python SDK directly):

  1. Create a large file, >= 17 GB
  2. Store it in blob storage
  3. Create a new Python project, install Azure Python SDK
  4. Run the following code:
    from azure.storage.blob import BlobServiceClient
    from time import time
    import logging

    conn_id = <your_con_id>
    client = BlobServiceClient.from_connection_string(conn_id)

    blob_client = client.get_blob_client(container=<container_name>,
                                         blob='<blob>')

    print(f'Fetching blob from storage, {time()}')
    res = blob_client.download_blob(max_concurrency=20, timeout=1800)
    print(f'Starting to read file, {time()}')
    data = res.readall()

Expected behavior

Either the chunk that is hanging should fail on timeout and retry, or if it exhausted all retries it should fail the download. Otherwise, the download should complete.

Additional Context

Related bugs:

@ghost ghost added customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 9, 2023
@github-actions github-actions bot added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 9, 2023
@kashifkhan
Copy link
Member

Thank you for the feedback @YuvalItzchakov . We will investigate and get back to you asap.

@kashifkhan kashifkhan added the Storage Storage Service (Queues, Blobs, Files) label Feb 10, 2023
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 10, 2023
@jalauzon-msft
Copy link
Member

Hi @YuvalItzchakov Yuval, I believe you are running into a couple issues related to timeouts which are causing this behavior. Firstly, you are setting the timeout parameter of download_blob (and I assume the CLI command is as well) which sets the server-side operation timeout not any client-side network timeouts. this timeout will have no effect if you are trying to set a client-side network timeout.

To configure client-side timeouts, see this section of our README. For reads, you would likely be interested in read_timeout which allows you to configure the time the client will wait for a read operation to return. Any timeouts here will be automatically retried 3 times (by default). The option can be set on the individual operation call (download_blob) or on client construction. I'm not immediately aware of how to set this value in CLI.

The other issue here is that before version 12.14.0, the default read_timeout was set extremely high where downloads would essentially hang. In version 12.14.0 we set this default to a much more reasonable 60 seconds. So, updating to the latest version or manually configuring this timeout value will likely help in your scenario.

NOTE: In main we have updated the documentation for timeout to more clearly indicate it's a server-side timeout. We also are still evaluating further improvements to documentation around timeouts.

@YuvalItzchakov
Copy link
Author

Thank you for the elaborate explanation @jalauzon-msft. Will try it and report back.

@xiangyan99 xiangyan99 added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Feb 17, 2023
@YuvalItzchakov
Copy link
Author

YuvalItzchakov commented Feb 21, 2023

Hi @jalauzon-msft. Updating the SDK version has indeed mitigated the issue using read_timeout and we are seeing the download fail in case a particular chunk gets stuck! (Specifically we are using apache-airflow-providers-microsoft-azure: 5.2.0 since this is invoked via Airflow)

A follow up question: Any reason why we're seeing these chunk downloads fail particularly on large files? Could there be something problematic at the blob storage API level?

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Feb 21, 2023
@jalauzon-msft
Copy link
Member

Hi @YuvalItzchakov Yuval, because the adjustments to read_timeout have helped, this means you are seeing client-side network timeouts when downloading the blob. There is a chance this could be caused by something on the Storage backend, but it is more likely caused by something network related on the client side (on your end).

If you have not changed the default settings for chunk sizes, the SDK will download blob in 4 MiB chunks and the new default read_timeout is 60 seconds. This means you were not able to download the 4 MiB in the 60 seconds allotted. Additionally, these timeouts should automatically be retried a few times and therefore for one chunk to fail the download, it would need to timeout a few times in a row.

Given that information and the error you linked in the original post urllib3.connectionpool: Connection pool is full, discarding connection: x.blob.core.windows.net, my hunch is that your client machine is experiencing some network congestion, causing a slowdown and possibly even some connections to be killed. It seems towards the end of the download you are running out of connections in your pool.

There are a number of things that could cause this, but something you could try is lowering your max_concurrency value you pass to download_blob. This value sets the number of threads the SDK will use, with each simultaneously downloading a chunk. Perhaps 20 is too many for your environment. The other thing you can try is tweaking the max_chunk_get_size keyword on client construction to change the chunk size. Making it larger will reduce the overall number of chunks needed to be downloaded which perhaps could reduce your number of connections or making it smaller means each individual download would be smaller but there would be more chunks to download. I would recommend trying max_concurrency first.

If you still think the network on your end is okay, you can try opening a Support ticket to have the backend team take a look. They will have more resources to determine the cause but could come to the same conclusion I have.

@jalauzon-msft jalauzon-msft added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Feb 23, 2023
@ghost
Copy link

ghost commented Feb 23, 2023

Hi @YuvalItzchakov. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

@ghost ghost removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Feb 23, 2023
@ghost
Copy link

ghost commented Mar 3, 2023

Hi @YuvalItzchakov, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

@ghost ghost closed this as completed Mar 3, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jun 1, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

5 participants