Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listing blobs names is very slow #19755

Closed
jochen-ott-by opened this issue Jul 12, 2021 · 24 comments
Closed

Listing blobs names is very slow #19755

jochen-ott-by opened this issue Jul 12, 2021 · 24 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team pillar-performance The issue is related to performance, one of our core engineering pillars. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Milestone

Comments

@jochen-ott-by
Copy link

  • Package Name: azure-storage-blob
  • Package Version: 12.8.1
  • Operating System: linux (Debian 9)
  • Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce
Steps to reproduce the behavior:

  1. create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
  2. use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).
  3. use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)
  4. Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 12, 2021
@xiangyan99 xiangyan99 added bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files) and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 12, 2021
@ghost
Copy link

ghost commented Jul 12, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details
  • Package Name: azure-storage-blob
  • Package Version: 12.8.1
  • Operating System: linux (Debian 9)
  • Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce
Steps to reproduce the behavior:

  1. create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
  2. use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).
  3. use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)
  4. Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

Author: jochen-ott-by
Assignees: -
Labels:

Client, Service Attention, Storage, bug, customer-reported, needs-triage, question

Milestone: -

@xiangyan99
Copy link
Member

Thanks for the feedback, we’ll investigate asap.

@ghost
Copy link

ghost commented Jul 12, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details
  • Package Name: azure-storage-blob
  • Package Version: 12.8.1
  • Operating System: linux (Debian 9)
  • Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce
Steps to reproduce the behavior:

  1. create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
  2. use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it's 376ms).
  3. use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it's 2760ms)
  4. Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via Azure/azure-storage-python#545
See also additional context there, in particular the use cases listed in Azure/azure-storage-python#545 (comment)

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

Author: jochen-ott-by
Assignees: -
Labels:

Client, Service Attention, Storage, bug, customer-reported, needs-triage, question

Milestone: -

@jochen-ott-by
Copy link
Author

To elaborate more on my last point ("it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X"), I created a hacky patch that implements list_blob_names for azure-storage-blob 12.X which works by patching the deserialization code to only extract the name:

class BlobItemNameOnly(msrest.serialization.Model):


    _attribute_map = {
        'name': {'key': 'Name', 'type': 'str'},
    }
    _xml_map = {
        'name': 'Blob'
    }


    def __init__(
        self,
        *,
        name: str,
        **kwargs
    ):
        super(BlobItemNameOnly, self).__init__(**kwargs)
        self.name = name

@contextmanager
def _patch_blob_deserializer(container_client):
        with mock.patch.dict(
                container_client._client._deserialize.dependencies,
                {"BlobItemInternal": BlobItemNameOnly}
        ):
            yield

With this patch active, container_client.list_blobs would return instances of BlobItemNameOnly, and parsing a single page of xml results of around 5000 blobs is down from 2.76s cpu time to 0.48s.

@g2vinay g2vinay added the pillar-performance The issue is related to performance, one of our core engineering pillars. label Jul 13, 2021
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jul 13, 2021
@annatisch
Copy link
Member

Thanks @jochen-ott-by! I found your patch very interesting!

It's true that the current implementation deserializes the entire payload without ability to customize. Additionally, since the default API version in SDK 2.x, more data is being returned in the listing results which further slows it down.

We have had discussions about replacing the current (de)serialization implementation - however this will be a fairly long term project.

@xiafu-msft - This sounds like it might be worth bring up across languages, as I would think the deserialization of the full listing payload might be an issue that affects all the Blob SDKs. It could be worth looking at reimplementing the "list only names" feature - I don't think the service supports a select feature here? So I guess it would be a client-side implementation of select?

@annatisch
Copy link
Member

After doing a bit more digging, there are a number of strategies we could invest in here.

  1. We start the process of migrating to a new XML deserialization pipeline that is more efficient than the code provided in the msrest lib. This will ultimately happen, however it's not a quick solution as this will take some time. Additionally, it doesn't resolve the more immediate issue of eagerly deserializing the entire payload.
  2. We look at whether we can 'lazily' deserialize the XML. This would also not be a quick fix, and would need some major internal rewiring to support. It also changes the error behaviour somewhat - now an invalid model would only be detected when it's content was accessed rather than when it was received. So I'm somewhat doubtful that this is the right way to go. We'd also want to discuss this one across languages.
  3. We implement some kind of client-side 'select' feature - when we only deserialize specific fields in the response. This could be surfaced as an option to the current list_blobs API, or as an entirely new API. This would be the quickest solution to the problem - and while it doesn't improve all-round performance, it should better enable this scenario, and gives us more time to properly design option 1.

Adding @tg-msft, @kasobol-msft and @mikeharder for their thoughts. Option 1 would only impact Python, however options 2 and 3 would probably need some cross-language consistency.

@annatisch
Copy link
Member

Thanks for your patience @jochen-ott-by!
We've been chatting about the best approach, and are considering tackling this by addressing both options 1 & 3 in my post above (where option 3 takes the form of a separate list_names API, similar to the v2 SDK).

I currently have a working prototype in development here that we have been running perf tests on:
#19814

The numbers are looking promising, with improvements to listing in general, as well as providing the "names-only" deserialization shortcut. There's still a fair amount of work to be done to get these strategies "production-ready", as they dig quite deep into the HTTP pipeline code, and will need thorough testing - so I cannot give you a concrete timeframe when they will land in a release at this point.

We will keep the thread open and updated as we progress.
Thanks again for the report!

@amishra-dev
Copy link

@tasherif-msft can you sync with @annatisch about this

@tasherif-msft
Copy link
Contributor

Here's the POC #19814
This will probably take a while so we are looking to ship a workaround to improve the perf in the meantime. I will update you soon.

@lmazuel lmazuel modified the milestones: [2021] December, [2022] April Feb 18, 2022
@amishra-dev
Copy link

@tasherif-msft what is the latest on this?

@tasherif-msft
Copy link
Contributor

@amishra-dev the change on core is substantial and will take several iterations as @annatisch have stated. To sidestep this in the meantime we decided we can implement our own deserializing logic on our layer. @jalauzon-msft have you had a chance to investigate handling this deserialization on our layer?

@mikeharder
Copy link
Member

@tasherif-msft: Is adding list_blob_names back to the SDK still under consideration?

@b-c-lucas
Copy link

Originally reported in #11593

@jalauzon-msft
Copy link
Member

Apologies for the long delay here, we've been busy with other higher priority work but hope to be getting to this soon.

To update, we are planning to add get_blob_names as a new API when we can. This will be a version of the list blobs API that only list blob names and will be significantly faster than the full list_blobs and hopefully come close to or surpass the performance of the older version of the library. We will start from Anna's Draft PR and extract just the pieces necessary for get_blob_names. This means we will start with custom XML parsing only for this new API and will introduce a small subset of the full custom XML parsing to support this scenario.

azure-sdk pushed a commit to azure-sdk/azure-sdk-for-python that referenced this issue Jul 13, 2022
Revert "Adding status code 202 to Private endpoints PUT (Azure#19125)" (Azure#19755)

This reverts commit 5e7603d4591ae39f9c2cedea75c8d97185e0aab2.
@jalauzon-msft
Copy link
Member

I'm happy to share that we finally have opened #25747 to add list_blob_names to the Track2 Blob SDK. This API, like the Track1 equivalent, will call the standard List Blobs API but only parse and return the blob names which results in a significant speedup over the traditional list_blobs API when only names are desired.

Some initial perf testing results show this API is 1.5-14 times faster than the existing list_blobs depending on the number of blobs in the container. (The more blobs the larger the performance increase). See comment in #25747 for details. Specifically for the OP's case of 5000 blobs, we expect somewhere around 8-10x improvement.

We will work to get this merged and released ASAP. Thanks all for your patience and apologies for the long delay on getting this in.

@jalauzon-msft
Copy link
Member

Hi @jochen-ott-by and others, the list_blob_names API was released yesterday in a beta release, 12.14.0b2! Please feel free to try out the beta and provide any feedback on the API. We are planning to do a full release that will include list_blob_names sometime in September/early October.

Since this is merged and released, I'm going to close this issue. Thanks!

@hholst80
Copy link

hholst80 commented Nov 18, 2022

I have quite a bit of blobs in my container. But I do not see any performance improvement.

Please advice.

image

Update: from inside Azure networking to a Storage account in the same account. Terrible performance.

image

@mikeharder
Copy link
Member

@hholst80: Can you share instructions to reproduce what you are seeing?

  • How many blobs are you listing?
  • Standard or Premium Storage Account?
  • Size and location of the client VM? Same region as storage account?

@mikeharder mikeharder reopened this Dec 6, 2022
@mikeharder
Copy link
Member

@hholst80: It looks like you are listing 8934 blobs. In our testing, we can list this many blobs in about 2 seconds using the list_blobs() API, and in 1 second using the list_blob_names() API.

In our test, our container only contains 8934 blobs. If your container contains a lot more blobs (but only 8934 blobs starting with p10000/), it's possible the name filtering is making it slower. We can test with the exact number of blobs in your container.

Our config:

  • Storage Account: Premium BlockBlobStorage
  • Client VM: D4ds_v5 (4 vcpus, 16 GiB memory), same region as storage account
  • Python: 3.11.0

@jalauzon-msft
Copy link
Member

jalauzon-msft commented Dec 6, 2022

Since you are seeing the same performance between list_blobs and list_blob_names, it means this is likely not related to client-side processing. This means most of the time is spent either in the backend processing or in networking.

There are a number of other factors that can cause slower listing times in the backend:

  • How many total blobs in your container? You are listing 8934 but the backend will have to iterate every blob in the container so if there are a large number of blobs, this can affect perf.
  • Do you have blob soft-delete enabled with many soft-deleted objects or blob versioning enabled and many old versions of the blobs in your container? I believe both versions and soft-deleted blobs are still enumerated in the backend.
  • Do you have Hierarchical Namespace (HNS) enabled on your Storage account? Sometimes this can affect perf.

NOTE: This should probably be tracked as a separate issue.

@mikeharder
Copy link
Member

@hholst80: Please open a new issue if you'd like us to investigate your scenario further.

@hholst80
Copy link

hholst80 commented Dec 7, 2022

  • How many blobs are you listing?
  • Standard or Premium Storage Account?
  • Size and location of the client VM? Same region as storage account?

About 10 thousand files in the p10000 prefix. About one million in the container.
Standard Standard Account tier.
VM in same region as Storage Account.

@mikeharder
Copy link
Member

About 10 thousand files in the p10000 prefix. About one million in the container. Standard Standard Account tier. VM in same region as Storage Account.

In a standard storage account (with hierarchical namespace disabled), I created a container with 10k blobs named p10000/{guid}, and 990k blobs named {guid}.

Both list_blobs(name_starts_with="p10000/") and list_blob_names(name_starts_with="p10000/") take about 4 seconds, which is much faster than the 25-30s you reported.

@hholst80
Copy link

hholst80 commented Dec 8, 2022

I reported the performance variation to support. Thank you for checking. Our pre-prod system is much faster than our prod system for some reason (similar amount of blobs in the container).

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team pillar-performance The issue is related to performance, one of our core engineering pillars. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests