Listing blobs names is very slow #19755

jochen-ott-by · 2021-07-12T13:31:31Z

Package Name: azure-storage-blob
Package Version: 12.8.1
Operating System: linux (Debian 9)
Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce
Steps to reproduce the behavior:

create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).
use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)
Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-12T18:25:49Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details

Package Name: azure-storage-blob
Package Version: 12.8.1
Operating System: linux (Debian 9)
Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce
Steps to reproduce the behavior:

create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).
use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)
Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

Author:	jochen-ott-by
Assignees:	-
Labels:	`Client`, `Service Attention`, `Storage`, `bug`, `customer-reported`, `needs-triage`, `question`
Milestone:	-

xiangyan99 · 2021-07-12T18:25:50Z

Thanks for the feedback, we’ll investigate asap.

ghost · 2021-07-12T18:25:52Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details

Package Name: azure-storage-blob
Package Version: 12.8.1
Operating System: linux (Debian 9)
Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce
Steps to reproduce the behavior:

create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).
use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)
Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

Author:	jochen-ott-by
Assignees:	-
Labels:	`Client`, `Service Attention`, `Storage`, `bug`, `customer-reported`, `needs-triage`, `question`
Milestone:	-

jochen-ott-by · 2021-07-13T05:22:07Z

To elaborate more on my last point ("it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X"), I created a hacky patch that implements list_blob_names for azure-storage-blob 12.X which works by patching the deserialization code to only extract the name:

class BlobItemNameOnly(msrest.serialization.Model):


    _attribute_map = {
        'name': {'key': 'Name', 'type': 'str'},
    }
    _xml_map = {
        'name': 'Blob'
    }


    def __init__(
        self,
        *,
        name: str,
        **kwargs
    ):
        super(BlobItemNameOnly, self).__init__(**kwargs)
        self.name = name

@contextmanager
def _patch_blob_deserializer(container_client):
        with mock.patch.dict(
                container_client._client._deserialize.dependencies,
                {"BlobItemInternal": BlobItemNameOnly}
        ):
            yield

With this patch active, container_client.list_blobs would return instances of BlobItemNameOnly, and parsing a single page of xml results of around 5000 blobs is down from 2.76s cpu time to 0.48s.

annatisch · 2021-07-13T21:31:38Z

Thanks @jochen-ott-by! I found your patch very interesting!

It's true that the current implementation deserializes the entire payload without ability to customize. Additionally, since the default API version in SDK 2.x, more data is being returned in the listing results which further slows it down.

We have had discussions about replacing the current (de)serialization implementation - however this will be a fairly long term project.

@xiafu-msft - This sounds like it might be worth bring up across languages, as I would think the deserialization of the full listing payload might be an issue that affects all the Blob SDKs. It could be worth looking at reimplementing the "list only names" feature - I don't think the service supports a select feature here? So I guess it would be a client-side implementation of select?

annatisch · 2021-07-14T14:10:33Z

After doing a bit more digging, there are a number of strategies we could invest in here.

We start the process of migrating to a new XML deserialization pipeline that is more efficient than the code provided in the msrest lib. This will ultimately happen, however it's not a quick solution as this will take some time. Additionally, it doesn't resolve the more immediate issue of eagerly deserializing the entire payload.
We look at whether we can 'lazily' deserialize the XML. This would also not be a quick fix, and would need some major internal rewiring to support. It also changes the error behaviour somewhat - now an invalid model would only be detected when it's content was accessed rather than when it was received. So I'm somewhat doubtful that this is the right way to go. We'd also want to discuss this one across languages.
We implement some kind of client-side 'select' feature - when we only deserialize specific fields in the response. This could be surfaced as an option to the current list_blobs API, or as an entirely new API. This would be the quickest solution to the problem - and while it doesn't improve all-round performance, it should better enable this scenario, and gives us more time to properly design option 1.

Adding @tg-msft, @kasobol-msft and @mikeharder for their thoughts. Option 1 would only impact Python, however options 2 and 3 would probably need some cross-language consistency.

annatisch · 2021-07-26T17:46:22Z

Thanks for your patience @jochen-ott-by!
We've been chatting about the best approach, and are considering tackling this by addressing both options 1 & 3 in my post above (where option 3 takes the form of a separate list_names API, similar to the v2 SDK).

I currently have a working prototype in development here that we have been running perf tests on:
#19814

The numbers are looking promising, with improvements to listing in general, as well as providing the "names-only" deserialization shortcut. There's still a fair amount of work to be done to get these strategies "production-ready", as they dig quite deep into the HTTP pipeline code, and will need thorough testing - so I cannot give you a concrete timeframe when they will land in a release at this point.

We will keep the thread open and updated as we progress.
Thanks again for the report!

amishra-dev · 2021-12-01T01:16:14Z

@tasherif-msft can you sync with @annatisch about this

tasherif-msft · 2021-12-02T00:37:15Z

Here's the POC #19814
This will probably take a while so we are looking to ship a workaround to improve the perf in the meantime. I will update you soon.

amishra-dev · 2022-03-01T00:54:17Z

@tasherif-msft what is the latest on this?

tasherif-msft · 2022-03-01T02:02:16Z

@amishra-dev the change on core is substantial and will take several iterations as @annatisch have stated. To sidestep this in the meantime we decided we can implement our own deserializing logic on our layer. @jalauzon-msft have you had a chance to investigate handling this deserialization on our layer?

mikeharder · 2022-03-01T02:08:02Z

@tasherif-msft: Is adding list_blob_names back to the SDK still under consideration?

b-c-lucas · 2022-03-10T00:41:47Z

Originally reported in #11593

jalauzon-msft · 2022-06-30T21:46:05Z

Apologies for the long delay here, we've been busy with other higher priority work but hope to be getting to this soon.

To update, we are planning to add get_blob_names as a new API when we can. This will be a version of the list blobs API that only list blob names and will be significantly faster than the full list_blobs and hopefully come close to or surpass the performance of the older version of the library. We will start from Anna's Draft PR and extract just the pieces necessary for get_blob_names. This means we will start with custom XML parsing only for this new API and will introduce a small subset of the full custom XML parsing to support this scenario.

Revert "Adding status code 202 to Private endpoints PUT (Azure#19125)" (Azure#19755) This reverts commit 5e7603d4591ae39f9c2cedea75c8d97185e0aab2.

jalauzon-msft · 2022-08-19T19:01:40Z

I'm happy to share that we finally have opened #25747 to add list_blob_names to the Track2 Blob SDK. This API, like the Track1 equivalent, will call the standard List Blobs API but only parse and return the blob names which results in a significant speedup over the traditional list_blobs API when only names are desired.

Some initial perf testing results show this API is 1.5-14 times faster than the existing list_blobs depending on the number of blobs in the container. (The more blobs the larger the performance increase). See comment in #25747 for details. Specifically for the OP's case of 5000 blobs, we expect somewhere around 8-10x improvement.

We will work to get this merged and released ASAP. Thanks all for your patience and apologies for the long delay on getting this in.

jalauzon-msft · 2022-08-31T18:16:47Z

Hi @jochen-ott-by and others, the list_blob_names API was released yesterday in a beta release, 12.14.0b2! Please feel free to try out the beta and provide any feedback on the API. We are planning to do a full release that will include list_blob_names sometime in September/early October.

Since this is merged and released, I'm going to close this issue. Thanks!

hholst80 · 2022-11-18T20:18:58Z

I have quite a bit of blobs in my container. But I do not see any performance improvement.

Please advice.

Update: from inside Azure networking to a Storage account in the same account. Terrible performance.

mikeharder · 2022-12-06T21:34:19Z

@hholst80: Can you share instructions to reproduce what you are seeing?

How many blobs are you listing?
Standard or Premium Storage Account?
Size and location of the client VM? Same region as storage account?

mikeharder · 2022-12-06T22:03:43Z

@hholst80: It looks like you are listing 8934 blobs. In our testing, we can list this many blobs in about 2 seconds using the list_blobs() API, and in 1 second using the list_blob_names() API.

In our test, our container only contains 8934 blobs. If your container contains a lot more blobs (but only 8934 blobs starting with p10000/), it's possible the name filtering is making it slower. We can test with the exact number of blobs in your container.

Our config:

Storage Account: Premium BlockBlobStorage
Client VM: D4ds_v5 (4 vcpus, 16 GiB memory), same region as storage account
Python: 3.11.0

jalauzon-msft · 2022-12-06T22:49:40Z

Since you are seeing the same performance between list_blobs and list_blob_names, it means this is likely not related to client-side processing. This means most of the time is spent either in the backend processing or in networking.

There are a number of other factors that can cause slower listing times in the backend:

How many total blobs in your container? You are listing 8934 but the backend will have to iterate every blob in the container so if there are a large number of blobs, this can affect perf.
Do you have blob soft-delete enabled with many soft-deleted objects or blob versioning enabled and many old versions of the blobs in your container? I believe both versions and soft-deleted blobs are still enumerated in the backend.
Do you have Hierarchical Namespace (HNS) enabled on your Storage account? Sometimes this can affect perf.

NOTE: This should probably be tracked as a separate issue.

mikeharder · 2022-12-06T22:51:28Z

@hholst80: Please open a new issue if you'd like us to investigate your scenario further.

hholst80 · 2022-12-07T07:29:36Z

How many blobs are you listing?

Standard or Premium Storage Account?

Size and location of the client VM? Same region as storage account?

About 10 thousand files in the p10000 prefix. About one million in the container.
Standard Standard Account tier.
VM in same region as Storage Account.

mikeharder · 2022-12-07T22:59:27Z

About 10 thousand files in the p10000 prefix. About one million in the container. Standard Standard Account tier. VM in same region as Storage Account.

In a standard storage account (with hierarchical namespace disabled), I created a container with 10k blobs named p10000/{guid}, and 990k blobs named {guid}.

Both list_blobs(name_starts_with="p10000/") and list_blob_names(name_starts_with="p10000/") take about 4 seconds, which is much faster than the 25-30s you reported.

hholst80 · 2022-12-08T18:54:57Z

I reported the performance variation to support. Thank you for checking. Our pre-prod system is much faster than our prod system for some reason (similar amount of blobs in the container).

xiangyan99 assigned xiafu-msft Jul 12, 2021

g2vinay added the pillar-performance The issue is related to performance, one of our core engineering pillars. label Jul 13, 2021

ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jul 13, 2021

annatisch mentioned this issue Oct 7, 2021

MQ Performance Improvements 2021 #21136

Closed

8 tasks

lmazuel added this to the [2021] December milestone Oct 7, 2021

lmazuel assigned tasherif-msft Nov 12, 2021

tasherif-msft mentioned this issue Dec 1, 2021

Customer is asserting that V12 Storage SDK has slower performance that V2.1 sdk #9596

Closed

lmazuel modified the milestones: [2021] December, [2022] April Feb 18, 2022

jalauzon-msft assigned jalauzon-msft and unassigned xiafu-msft and tasherif-msft Apr 18, 2022

jalauzon-msft assigned vincenttran-msft May 11, 2022

lmazuel modified the milestones: [2022] April, [2022] June May 16, 2022

vincenttran-msft mentioned this issue Jul 15, 2022

How to speed up printing blob names in large blob container? #25243

Closed

pathumd mentioned this issue Jul 19, 2022

get_blob_names in Azure Python SDK #25287

Closed

jalauzon-msft mentioned this issue Aug 19, 2022

[Storage] Add list_blob_names API to Blob SDK #25747

Merged

jalauzon-msft closed this as completed Aug 31, 2022

mikeharder reopened this Dec 6, 2022

mikeharder closed this as completed Dec 6, 2022

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Listing blobs names is very slow #19755

Listing blobs names is very slow #19755

jochen-ott-by commented Jul 12, 2021

ghost commented Jul 12, 2021

xiangyan99 commented Jul 12, 2021

ghost commented Jul 12, 2021

jochen-ott-by commented Jul 13, 2021

annatisch commented Jul 13, 2021

annatisch commented Jul 14, 2021

annatisch commented Jul 26, 2021

amishra-dev commented Dec 1, 2021

tasherif-msft commented Dec 2, 2021

amishra-dev commented Mar 1, 2022

tasherif-msft commented Mar 1, 2022

mikeharder commented Mar 1, 2022

b-c-lucas commented Mar 10, 2022

jalauzon-msft commented Jun 30, 2022

jalauzon-msft commented Aug 19, 2022

jalauzon-msft commented Aug 31, 2022

hholst80 commented Nov 18, 2022 •

edited

Loading

mikeharder commented Dec 6, 2022

mikeharder commented Dec 6, 2022

jalauzon-msft commented Dec 6, 2022 •

edited

Loading

mikeharder commented Dec 6, 2022

hholst80 commented Dec 7, 2022 •

edited

Loading

mikeharder commented Dec 7, 2022

hholst80 commented Dec 8, 2022 •

edited

Loading

Listing blobs names is very slow #19755

Listing blobs names is very slow #19755

Comments

jochen-ott-by commented Jul 12, 2021

ghost commented Jul 12, 2021

xiangyan99 commented Jul 12, 2021

ghost commented Jul 12, 2021

jochen-ott-by commented Jul 13, 2021

annatisch commented Jul 13, 2021

annatisch commented Jul 14, 2021

annatisch commented Jul 26, 2021

amishra-dev commented Dec 1, 2021

tasherif-msft commented Dec 2, 2021

amishra-dev commented Mar 1, 2022

tasherif-msft commented Mar 1, 2022

mikeharder commented Mar 1, 2022

b-c-lucas commented Mar 10, 2022

jalauzon-msft commented Jun 30, 2022

jalauzon-msft commented Aug 19, 2022

jalauzon-msft commented Aug 31, 2022

hholst80 commented Nov 18, 2022 • edited Loading

mikeharder commented Dec 6, 2022

mikeharder commented Dec 6, 2022

jalauzon-msft commented Dec 6, 2022 • edited Loading

mikeharder commented Dec 6, 2022

hholst80 commented Dec 7, 2022 • edited Loading

mikeharder commented Dec 7, 2022

hholst80 commented Dec 8, 2022 • edited Loading

hholst80 commented Nov 18, 2022 •

edited

Loading

jalauzon-msft commented Dec 6, 2022 •

edited

Loading

hholst80 commented Dec 7, 2022 •

edited

Loading

hholst80 commented Dec 8, 2022 •

edited

Loading