Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Blob download: short reads #16723

Closed
jochen-ott-by opened this issue Feb 12, 2021 · 16 comments
Closed

[Storage] Blob download: short reads #16723

jochen-ott-by opened this issue Feb 12, 2021 · 16 comments
Assignees
Labels
Azure.Core bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team
Milestone

Comments

@jochen-ott-by
Copy link

  • Package Name: azure-storage-blob
  • Package Version: 12.5.0
  • Operating System: Debian 9
  • Python Version: python3.6

Other relevant packages: azure-core==1.10.0

Describe the bug

Reading the blob content can return too few bytes, leading to data corruption.

To Reproduce
Steps to reproduce the behavior:

  1. create some blobs with fake data (not too small: best beyond the maximum get blob size, which has a default of 32MB)
  2. download these blobs completely via BlobClient.download_blob().readall(). If the download returned without error, compare the content with the expected one from the upload in step 1.

Note that you have to repeat step 2. about 1 million times (depending on network, may be more or less; I estimate this is the order of probability we observe); alternatively, take other measures to increase the likelihood of a connection error while downloading the blob content.

Expected behavior

When the blob download completed without exception, the download content must match exactly with what was uploaded before, including the length.

Additional context

We observe this mainly when downloading parts of blobs, i.e. when passing non-zero offset and length parameters to BlobClient.download_blob()

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 12, 2021
@jochen-ott-by
Copy link
Author

jochen-ott-by commented Feb 12, 2021

I wrote a test for the bug I believe is responsible for it, see
5dee530

I believe the bug is in the retry behavior of azure.core.pipeline.transport._request_basic.StreamDownloadGenerator.__next__:

while retry_active:
try:
chunk = next(self.iter_content_func)
if not chunk:
raise StopIteration()
self.downloaded += self.block_size
return chunk
except StopIteration:
self.response.internal_response.close()
raise StopIteration()
except (requests.exceptions.ChunkedEncodingError,
requests.exceptions.ConnectionError):
retry_total -= 1
if retry_total <= 0:
retry_active = False
else:
time.sleep(retry_interval)
headers = {'range': 'bytes=' + str(self.downloaded) + '-'}
resp = self.pipeline.run(self.request, stream=True, headers=headers)
if resp.http_response.status_code == 416:
raise
chunk = next(self.iter_content_func)
if not chunk:
raise StopIteration()
self.downloaded += len(chunk)
return chunk
continue

(self.iter_content_func is response.iter_content(self.block_size), where response is the requests.Response object)

What can happen is that L124 fetches the next chunk from the network. This usually works fine. However, assume there is a ConnectionError. This invokes the retry behavior starting at L134. If nothing else goes wrong, the next interesting thing that happens is L143: the code actually calls next on a generator that already raised an exception! This will always raise a StopIteration. This exception propagates and the caller of StreamDownloadGenerator.__next__ assumes that the download is complete (although it's not).

Some other observations:

  • It's unclear what L139-L142 are intended for. They seem to be intended to check if the range has become invalid. However, merely setting the "range" header is wrong for a number of reasons, including that blob storage requests usually set x-ms-range, which takes priority at the server. Also, the original request often has a range already set. Just starting to request from the current chunks .downloaded makes no sense.
  • L146 is inconsistent from L127 in how self.downloaded is updated. (L146 is wrong)
  • In general, re-trying to re-use the same connection after there was a connection error is extremely fragile, as it's usually not clear at which position the low-level stream actually is and how much bytes (if any) have been read (e.g. into some intermediate buffer). Also, it's unclear what kind of connection errors would actually benefit from a retry that uses the same (broken!) connection. So I think the whole retry code should be removed: Instead, this should just raise and give the layers above a chance to initiate a re-try.

jochen-ott-by added a commit to jochen-ott-by/azure-sdk-for-python that referenced this issue Feb 12, 2021
@jochen-ott-by
Copy link
Author

A possible fix for the requests backend would simply be to remove the retry code and instead let any connection-related exception propagate to the caller rather than returning too few bytes, see
a198cf7

However, other backends should also be reviewed and adapted if necessary.

@xiangyan99 xiangyan99 self-assigned this Feb 13, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 13, 2021
@xiangyan99 xiangyan99 added bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 13, 2021
@xiangyan99
Copy link
Member

Thanks for the feedback, we’ll investigate asap.

@xiangyan99 xiangyan99 added this to the [2021] March milestone Feb 16, 2021
@xiangyan99
Copy link
Member

@jochen-ott-by Thanks for your feedback. I am afraid we cannot just simply remove the retry logic because

  1. It has values for most of the customers
  2. Removing is a potential breaking change

I will do some investigation and try to repro it. (Note that you have to repeat step 2. about 1 million times sounds a little tricky)

@jochen-ott-by
Copy link
Author

jochen-ott-by commented Feb 17, 2021

@jochen-ott-by Thanks for your feedback. I am afraid we cannot just simply remove the retry logic because

1. It has values for most of the customers
2. Removing is a potential breaking change

Let me be clear here: the retry logic like it's implemented right now (for the requests backend) is completely buggy and it's never effective in actually retrying anything. It will instead always return corrupted data when these exception happen. It has negative value for all customers, a very large one for some.

Removing the retry is a fix. Nothing less, nothing more. One can remove the re-try without changing the interface, so this does not have to be "breaking" (if one would only apply the changes in __next__, this would fix the issue already and be completely API compatible).

(Note that for other backends, the story might be a bit different, as they are read-based, and re-trying a read might make sense).

So here is a fix for the requests backend that is API compatible and explains in the comments in some depth why a re-try at that point in the code cannot be correct:
2d31192

jochen-ott-by added a commit to jochen-ott-by/azure-sdk-for-python that referenced this issue Feb 17, 2021
@peter-hoffmann-by
Copy link

Hi @xiangyan99,

5dee530 has a reproducing test for the wrong behaviour, so better start with this one.

@xiangyan99
Copy link
Member

@jochen-ott-by Thanks for your feedback. I am afraid we cannot just simply remove the retry logic because

1. It has values for most of the customers
2. Removing is a potential breaking change

Let me be clear here: the retry logic like it's implemented right now (for the requests backend) is completely buggy and it's never effective in actually retrying anything. It will instead always return corrupted data when these exception happen. It has negative value for all customers, a very large one for some.

Removing the retry is a fix. Nothing less, nothing more. One can remove the re-try without changing the interface, so this does not have to be "breaking" (if one would only apply the changes in __next__, this would fix the issue already and be completely API compatible).

(Note that for other backends, the story might be a bit different, as they are read-based, and re-trying a read might make sense).

So here is a fix for the requests backend that is API compatible and explains in the comments in some depth why a re-try at that point in the code cannot be correct:
2d31192

I agree with you. Buggy behavior has negative values.

So let's fix the buggy behavior and keep the retry logic. :)

I will keep you updated.

@xiangyan99
Copy link
Member

I see there is something buggy in L140 that if ConnectionError pops up, we don't handle it correctly. I proposed a fix for it. I will let you know when it is released.

@lr4d
Copy link

lr4d commented Feb 19, 2021

@xiangyan99 I think the point that @jochen-ott-by tries to make here is that the current retry implementation is broken, not that retrying per se is the culprit.

The retry logic could be re-implemented in a way which does not re-use the same connection so as to address this point by @jochen-ott-by

In general, re-trying to re-use the same connection after there was a connection error is extremely fragile, as it's usually not clear at which position the low-level stream actually is and how much bytes (if any) have been read (e.g. into some intermediate buffer).

As some background, we've been seeing very weird IO errors because of this bug, which have taken months to debug, given the nondeterministic and low-level nature of the issue.

@xiangyan99
Copy link
Member

@xiangyan99 I think the point that @jochen-ott-by tries to make here is that the current retry implementation is broken, not that retrying per se is the culprit.

The retry logic could be re-implemented in a way which does not re-use the same connection so as to address this point by @jochen-ott-by

In general, re-trying to re-use the same connection after there was a connection error is extremely fragile, as it's usually not clear at which position the low-level stream actually is and how much bytes (if any) have been read (e.g. into some intermediate buffer).

As some background, we've been seeing very weird IO errors because of this bug, which have taken months to debug, given the nondeterministic and low-level nature of the issue.

Thank you.

Yes, I think we are on same page.

Existing retry functions incorrectly.

Let's fix it.

@yangchengs
Copy link

The retry block:
Should we keep retry for ChunkedEncodingError and remove L133.
Add another block to handle ConnectionError.

@jochen-ott-by
Copy link
Author

jochen-ott-by commented Feb 22, 2021

I'm very concerned about the retry at this point of the library stack. Let me take some time to explain why and how complex a correct solution would be. I hope this convinces you to simply remove the retries altogether, which in my opinion is the proper way to deal with the problem at this point of the code.

First, as some context, consider what this code is used for: it is the basic building block for a lot of azure python sdk libraries. As such, it should be extremely careful about making assumptions about:

  • how it is used by other azure python sdk libraries
  • how servers behave
  • how underlying libraries (requests, urllib3 in this case) behave.

So this core library should avoid any assumptions beyond what the http standard allows.

When sticking to what the http standard says, a correct retry implementation is pretty involved. I listed some of the things to consider below. I make no claim this list is complete. Some of the points are not blockers in that they merely limit the usefulness of the re-try to certain situations (such as the signature problem), or they make the re-try much less efficient (when re-submitting the original request for a retry after all).

  • not all requests should be re-tried, re-try should be limited to safe http methods. As this is about streaming the body, the only http method this applies to is GET.
  • making the request again with a certain "range" header will require choosing proper parameters for the request range. I discussed in some detail in this comment: Raise exception rather than swallowing it if there is something wrong… #16783 (comment) why this is not possible in general with the requests/urllib3 stack. So one either has to restrict this "re-try with subrange" to certain cases that are known to work (such as: content-encoding and transfer encoding are non-chunked identify encodings). Alternatively, the re-try could re-submit the original request, without the attempt to set a new "range" parameter and discard any data already returned to the caller (note that this also side-steps the signature and range vs. x-ms-range problems discussed below).
  • setting a "range" header in the request that re-tries could be ignored by the service. This will be true e.g. for the GET blob operation, which will give precedence to x-ms-range, see https://docs.microsoft.com/en-us/rest/api/storageservices/specifying-the-range-header-for-blob-service-operations
  • setting a "range" header can invalidate authentication headers, see https://docs.microsoft.com/en-us/rest/api/storageservices/authorize-with-shared-key#blob-queue-and-file-services-shared-key-authorization
  • The server might ignore the "range" header and respond with 200, rather than 206. This must be checked in the client.
  • The response of the new request in the re-try will attempt to combine data from different requests. Such combination of partial data is prone to data corruption e.g. in case the server creates the response dynamically and can can have byte-level differences (say, the order of json keys), even if the semantics are the same. This is considered in the http standard, see section 3.3 of RFC 7234 ("Combining Partial Content") and section 4.3. of RFC 7233 ("Combining Ranges"). In short, ranges must not be combined, unless the responses have the same (non-weak) ETag.
  • The client MUST check the headers of the response, not only for status code 206 but also the specific range is what was expected, as other range fields (such as x-ms-range) can take precedence over the "range" header field.

Overall, the implementation of re-tries at this level of the library will be pretty complex, and still only apply to a certain subset of cases (for example, the re-try that sets ranges will never be effective for GET blob operations for the reasons discussed above). I therefore think the re-try at this point should be removed. It's better to let the exception propagate and to re-try at a higher level, where new requests for ranges can be constructed correctly (e.g. with the correct signature). Also, at this point there is usually a lot more context information available, which means that re-tries at higher levels do not need to consider all of the special cases listed above. For example, they might already be in the code path of a GET operation, so there would be no branching concerning the first point.

One final thought: even if you implement re-tries at some point in time, addressing all the subtleties I listed above, I think you should give some priority to fixing this bug quickly first, on a shorter time frame. After all, the current code is a bug resulting in data corruption. This should get more priority than implementing a re-try (which so far was, effectively, not there).

@xiangyan99
Copy link
Member

@jochen-ott-by you are right. It is more complex than we expected. :(

We checked in #17078 to disable the retry and will release the work around next week.

Thanks for the feedback and contribution.

@xiangyan99
Copy link
Member

Hi @jochen-ott-by, we disabled the retry in latest release. Please let us know if you are still blocked. Thank you.

@xiangyan99 xiangyan99 added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Mar 10, 2021
@jochen-ott-by
Copy link
Author

Hi @jochen-ott-by, we disabled the retry in latest release. Please let us know if you are still blocked. Thank you.

The fix looks good, thanks! I consider this fixed via #17078.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Mar 11, 2021
@xiangyan99
Copy link
Member

Yes. That was the PR. Thanks for the confirmation.

openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-python that referenced this issue Jan 19, 2022
Compute 2021-11-01 Release (Azure#17120)

* compute folder with pre-population

* fix examples folder

* fix an example file

* update compute to match last version

* run prettier fix examples

* change capitalization

* Compute Swagger changes to include cross-region Restore Points scenarios (Azure#16570)

* RRP changes in swagger

* Fix conflicts

* fix validations

* Fix instance view

* Revert "Compute Swagger changes to include cross-region Restore Points scenarios (Azure#16570)" (Azure#16663)

This reverts commit 70a8729e86b30440cdd3c239272e31dff7f9627b.

* sync with last version

* add missing examples

* suppress bodyTopLevelProperties

* Add VMSS filter to List VMs (Azure#16813)

* change1 for change file

* change description

* change description

* add suppression for required properties

* Revert "Add VMSS filter to List VMs (Azure#16813)" (Azure#16956)

This reverts commit 231fd2260cb9e63ba16d8b15d405f3134fe612c1.

* Remove impossible state from example (Azure#16544)

* Downmerging change from Azure:main (Azure#16654)

Co-authored-by: Avinash Akka <[email protected]>

* Added CVM settings for version 2021-11-01 (Azure#16622)

* Added CVM settings

* added examples

* Add CMK examples

* fixed typo

* fixed file name

* prettier check

* removed a br

* changed to VMDiskSecurityProfile

* type as object

* Added vm size properties for vm scale set (Azure#16723)

* Added vm size properties

* Added get examples

* Added example for vm size properties

* Added example for vm size properties

Co-authored-by: Theodore Chang <[email protected]>

* Spec for ProtectedSettingsFromKeyVault and AllowExtensionOperations (Azure#16590)

* KV changes

* adding allowExtensionOperation

* fixing examples

* Add repairAction to auto repairs feature and update grace period to PT10M (Azure#16535)

* Add timeCreated to properties for VM, VMSS, CR, DH resources (Azure#16539)

* add creationTime to properties for VM, VMSS, CR, DH resources + remove
required location for Resources

* rename creationTime to timeCreated

* examples for timeCreated

* add minimum api-version to descriptions for timeCreated

* add GetVirtualMachineScaleSet example

* reformat example

Co-authored-by: Chase Van Buskirk <[email protected]>

* DedicatedHost Reboot Feature (Azure#16737)

* added new feature to compute.json

* added example for my feature

* prettier check on the reboot example

* fixed names of parameters in example file:

* changed name to DedicatedHosts_reboot and moved change to where the other DH APIs are

* added cloud error to the reboot

* fixed default error, it was in the wrong spot

* moved reboot to end of specs

* moved older swagger files to 2021-11-01

* reverted previous commit

* renamed reboot to restart as per sameers comment

* updated description as per sameers comment

* updated api version to 2021-11-01 per sameers comment

* Adding the new paramaters zone/placementGroupId to forceRecoveryServiceFabricPlatformUpdateDomainWalk VMSS API (Azure#17041)

* save (Azure#17091)

Co-authored-by: Theodore Chang <[email protected]>

* Update compute.json (Azure#16482)

When a customer tries to scale VMSS using Terraform, since the current definitions do not have publicIpPrefix property, the new VM will be assigned a random IP address which is outside the range of public Ip prefix. Customer has to resolve the issue by deleting the Vmss and recreating it. This PR tries to resolve the issue by adding the publicIpPrefix property in the JSON definition which is used to generate a request for VMSS update.

More details about the issue can be found here - Azure/azure-rest-api-specs#10190

* add vmss filter to list  (Azure#16957)

* change1 for change file

* change description

* change description

Co-authored-by: LexieXie <[email protected]>

* move Kashif's change to 2021-11-01

* Revert "Spec for ProtectedSettingsFromKeyVault and AllowExtensionOperations (Azure#16590)" (Azure#17121)

This reverts commit 220cfd0638942c04275d69fd485ceb2da02a96d3.

* fix CI failures, and run prettier on added examples

* for credscan. change password example

* Update readme.python.md

* Compute Swagger changes to include cross-region Restore Points scenarios (Azure#16682)

* RRP changes in swagger

* Fix conflicts

* fix validations

* Fix instance view

* change instance view

* Fix example

* Fix prettier

* Fix and modify description

* Review comments

* make new api call long-running-operation

* compute folder with pre-population

* fix examples folder

* fix an example file

* update compute to match last version

* run prettier fix examples

* change capitalization

* Compute Swagger changes to include cross-region Restore Points scenarios (Azure#16570)

* RRP changes in swagger

* Fix conflicts

* fix validations

* Fix instance view

* Revert "Compute Swagger changes to include cross-region Restore Points scenarios (Azure#16570)" (Azure#16663)

This reverts commit 70a8729e86b30440cdd3c239272e31dff7f9627b.

* sync with last version

* add missing examples

* Add VMSS filter to List VMs (Azure#16813)

* change1 for change file

* change description

* change description

* Revert "Add VMSS filter to List VMs (Azure#16813)" (Azure#16956)

This reverts commit 231fd2260cb9e63ba16d8b15d405f3134fe612c1.

* Remove impossible state from example (Azure#16544)

* Added CVM settings for version 2021-11-01 (Azure#16622)

* Added CVM settings

* added examples

* Add CMK examples

* fixed typo

* fixed file name

* prettier check

* removed a br

* changed to VMDiskSecurityProfile

* type as object

* Added vm size properties for vm scale set (Azure#16723)

* Added vm size properties

* Added get examples

* Added example for vm size properties

* Added example for vm size properties

Co-authored-by: Theodore Chang <[email protected]>

* Spec for ProtectedSettingsFromKeyVault and AllowExtensionOperations (Azure#16590)

* KV changes

* adding allowExtensionOperation

* fixing examples

* Add repairAction to auto repairs feature and update grace period to PT10M (Azure#16535)

* Add timeCreated to properties for VM, VMSS, CR, DH resources (Azure#16539)

* add creationTime to properties for VM, VMSS, CR, DH resources + remove
required location for Resources

* rename creationTime to timeCreated

* examples for timeCreated

* add minimum api-version to descriptions for timeCreated

* add GetVirtualMachineScaleSet example

* reformat example

Co-authored-by: Chase Van Buskirk <[email protected]>

* DedicatedHost Reboot Feature (Azure#16737)

* added new feature to compute.json

* added example for my feature

* prettier check on the reboot example

* fixed names of parameters in example file:

* changed name to DedicatedHosts_reboot and moved change to where the other DH APIs are

* added cloud error to the reboot

* fixed default error, it was in the wrong spot

* moved reboot to end of specs

* moved older swagger files to 2021-11-01

* reverted previous commit

* renamed reboot to restart as per sameers comment

* updated description as per sameers comment

* updated api version to 2021-11-01 per sameers comment

* Adding the new paramaters zone/placementGroupId to forceRecoveryServiceFabricPlatformUpdateDomainWalk VMSS API (Azure#17041)

* save (Azure#17091)

Co-authored-by: Theodore Chang <[email protected]>

* Update compute.json (Azure#16482)

When a customer tries to scale VMSS using Terraform, since the current definitions do not have publicIpPrefix property, the new VM will be assigned a random IP address which is outside the range of public Ip prefix. Customer has to resolve the issue by deleting the Vmss and recreating it. This PR tries to resolve the issue by adding the publicIpPrefix property in the JSON definition which is used to generate a request for VMSS update.

More details about the issue can be found here - Azure/azure-rest-api-specs#10190

* add vmss filter to list  (Azure#16957)

* change1 for change file

* change description

* change description

Co-authored-by: LexieXie <[email protected]>

* move Kashif's change to 2021-11-01

* fix CI failures, and run prettier on added examples

* Revert "Spec for ProtectedSettingsFromKeyVault and AllowExtensionOperations (Azure#16590)" (Azure#17121)

This reverts commit 220cfd0638942c04275d69fd485ceb2da02a96d3.

* for credscan. change password example

* make new api call long-running-operation

* Update readme.python.md

* Compute Swagger changes to include cross-region Restore Points scenarios (Azure#16682)

* RRP changes in swagger

* Fix conflicts

* fix validations

* Fix instance view

* change instance view

* Fix example

* Fix prettier

* Fix and modify description

* Review comments

* update Repair action to enum and update readme

* rebase to main since 2021-08-01 merged. update readme

* update x-ms-enum name for RepairType

* add default response to operations

* update

* put back 'required' tag for Resource.Location property
and use a new object for VM_LIST return object

* fix json format

* update examples

* run prettier on updated examples

* update example for credScan

* add VirtualMachineResource for toplevel property suppression.

* lint diff errors

* lint diff fix update

* remove change for VMextensions.location bug

* return readme file and examples before vm.vmextension.location change

Co-authored-by: sukodava <[email protected]>
Co-authored-by: Dapeng Zhang <[email protected]>
Co-authored-by: xielexie <[email protected]>
Co-authored-by: Mike Richmond <[email protected]>
Co-authored-by: Avinash <[email protected]>
Co-authored-by: Avinash Akka <[email protected]>
Co-authored-by: ms-saypaul <[email protected]>
Co-authored-by: Raktima Das <[email protected]>
Co-authored-by: kamusta-msft <[email protected]>
Co-authored-by: frank-pang-msft <[email protected]>
Co-authored-by: Chase VanBuskirk <[email protected]>
Co-authored-by: Chase Van Buskirk <[email protected]>
Co-authored-by: vbhasker-msft <[email protected]>
Co-authored-by: avjai <[email protected]>
Co-authored-by: kangsun-ctrl <[email protected]>
Co-authored-by: karthikka4820 <[email protected]>
Co-authored-by: LexieXie <[email protected]>
Co-authored-by: Yuchao Yan <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Core bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team
Projects
None yet
Development

No branches or pull requests

5 participants