-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Listing blobs names is very slow #19755
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. Issue Details
Describe the bug For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0 To Reproduce
Expected behavior There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X. Additional context This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks. azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545 I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.
|
Thanks for the feedback, we’ll investigate asap. |
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. Issue Details
Describe the bug For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0 To Reproduce
Expected behavior There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X. Additional context This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks. azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545 I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.
|
To elaborate more on my last point ("it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X"), I created a hacky patch that implements class BlobItemNameOnly(msrest.serialization.Model):
_attribute_map = {
'name': {'key': 'Name', 'type': 'str'},
}
_xml_map = {
'name': 'Blob'
}
def __init__(
self,
*,
name: str,
**kwargs
):
super(BlobItemNameOnly, self).__init__(**kwargs)
self.name = name
@contextmanager
def _patch_blob_deserializer(container_client):
with mock.patch.dict(
container_client._client._deserialize.dependencies,
{"BlobItemInternal": BlobItemNameOnly}
):
yield With this patch active, |
Thanks @jochen-ott-by! I found your patch very interesting! It's true that the current implementation deserializes the entire payload without ability to customize. Additionally, since the default API version in SDK 2.x, more data is being returned in the listing results which further slows it down. We have had discussions about replacing the current (de)serialization implementation - however this will be a fairly long term project. @xiafu-msft - This sounds like it might be worth bring up across languages, as I would think the deserialization of the full listing payload might be an issue that affects all the Blob SDKs. It could be worth looking at reimplementing the "list only names" feature - I don't think the service supports a |
After doing a bit more digging, there are a number of strategies we could invest in here.
Adding @tg-msft, @kasobol-msft and @mikeharder for their thoughts. Option 1 would only impact Python, however options 2 and 3 would probably need some cross-language consistency. |
Thanks for your patience @jochen-ott-by! I currently have a working prototype in development here that we have been running perf tests on: The numbers are looking promising, with improvements to listing in general, as well as providing the "names-only" deserialization shortcut. There's still a fair amount of work to be done to get these strategies "production-ready", as they dig quite deep into the HTTP pipeline code, and will need thorough testing - so I cannot give you a concrete timeframe when they will land in a release at this point. We will keep the thread open and updated as we progress. |
@tasherif-msft can you sync with @annatisch about this |
Here's the POC #19814 |
@tasherif-msft what is the latest on this? |
@amishra-dev the change on core is substantial and will take several iterations as @annatisch have stated. To sidestep this in the meantime we decided we can implement our own deserializing logic on our layer. @jalauzon-msft have you had a chance to investigate handling this deserialization on our layer? |
@tasherif-msft: Is adding |
Originally reported in #11593 |
Apologies for the long delay here, we've been busy with other higher priority work but hope to be getting to this soon. To update, we are planning to add |
Revert "Adding status code 202 to Private endpoints PUT (Azure#19125)" (Azure#19755) This reverts commit 5e7603d4591ae39f9c2cedea75c8d97185e0aab2.
I'm happy to share that we finally have opened #25747 to add Some initial perf testing results show this API is 1.5-14 times faster than the existing We will work to get this merged and released ASAP. Thanks all for your patience and apologies for the long delay on getting this in. |
Hi @jochen-ott-by and others, the Since this is merged and released, I'm going to close this issue. Thanks! |
@hholst80: Can you share instructions to reproduce what you are seeing?
|
@hholst80: It looks like you are listing 8934 blobs. In our testing, we can list this many blobs in about 2 seconds using the In our test, our container only contains 8934 blobs. If your container contains a lot more blobs (but only 8934 blobs starting with Our config:
|
Since you are seeing the same performance between There are a number of other factors that can cause slower listing times in the backend:
NOTE: This should probably be tracked as a separate issue. |
@hholst80: Please open a new issue if you'd like us to investigate your scenario further. |
About 10 thousand files in the p10000 prefix. About one million in the container. |
In a standard storage account (with hierarchical namespace disabled), I created a container with 10k blobs named Both |
I reported the performance variation to support. Thank you for checking. Our pre-prod system is much faster than our prod system for some reason (similar amount of blobs in the container). |
Describe the bug
For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0
To Reproduce
Steps to reproduce the behavior:
list_blob_names
to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).list_blobs
and accessblob.name
for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)Expected behavior
There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.
Additional context
This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.
azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)
I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.
The text was updated successfully, but these errors were encountered: