Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list_blobs is slow when paging through large containers #11593

Closed
mockodin opened this issue May 21, 2020 · 11 comments
Closed

list_blobs is slow when paging through large containers #11593

mockodin opened this issue May 21, 2020 · 11 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@mockodin
Copy link

list_blobs is slow when paging through a container containing millions of blobs where prefix filtering is not viable (millions of guids). Scanning a container of ~70M objects takes 8-12 hours.

Iterating a container for blob properties should allow for concurrent page calls. This should be either internally controlled/throttled (similar in concept to threaded azcopy downloads) or by providing continuation_tokens on initial response rather than requiring the list to be consumed first.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels May 21, 2020
@mockodin mockodin changed the title list_blobs is slow when paging through large container list_blobs is slow when paging through large containers May 21, 2020
@lmazuel lmazuel added Client This issue points to a problem in the data-plane of the library. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files) labels May 22, 2020
@ghost
Copy link

ghost commented May 22, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label May 22, 2020
@xiafu-msft
Copy link
Contributor

xiafu-msft commented May 27, 2020

Hi @mockodin

Thanks for reaching out!
Did you notice that you can use by_page() in this way:

        # block1
        generator1 = bsc.list_containers(results_per_page=2).by_page()
        page1 = next(generator1)
        containers1 = list(page1)

        # block2
        generator2 = bsc.list_containers(results_per_page=2).by_page(generator1.continuation_token)
        page2 = next(generator2)
        containers1 = list(page2)

you can submit each block as a task, then every time a task finishes, you can grab the continuation_token and submit a new task

Would you like to explain more about "requiring the list to be consumed first", sorry that I didn't get it clearly.

Let me know if this doesn't help!

@mockodin
Copy link
Author

mockodin commented May 29, 2020

@xiafu-msft
The example you gave is the method I'm using today (sample included bellow).

Answering and expanding slightly:
continuation_token is None until the paged return has been read (aka consumed) I do this via "list(next(blobs))". Each iteration take 5-10 seconds to complete. Calling a smaller page size does not seem to impact interval time significantly, if anything lengthens the overall processing time.

Code Example:

from azure.storage.blob import ContainerClient, StandardBlobTier
from azure.core.exceptions import ServiceResponseError, ServiceRequestError, ClientAuthenticationError, HttpResponseError
from queue import Queue

conn_str='<conn string here>'
Blobs = Queue()
continuation_token = None
container_client = ContainerClient.from_connection_string(conn_str=conn_str)
while True:
    try:               
        blobs = container_client.list_blobs(results_per_page=5000).by_page(continuation_token=continuation_token)
        Blobs.put([blob for blob in list(next(blobs))])
        continuation_token = blobs.continuation_token
        if not continuation_token:
            break
    except ClientAuthenticationError:
        container_client = ContainerClient.from_connection_string(conn_str=conn_str)

Ultimately my output is a list of blobs with individual blob size and tier for reporting.

@xiafu-msft
Copy link
Contributor

xiafu-msft commented Jun 2, 2020

Hi @mockodin
Thanks for reporting this, we found something we can optimize on SDK side. Currently SDK go over the list of blobs first to transform raw response into customer facing class when calling next(blobs). While it shouldn't increase the processing time that much.

Currently you can optimize it by only call next(blobs) first instead of list(next(blobs)) so that to avoid one round of iteration and see if that helps:

queued_pages = deque()
def task1():
    continuation_token = ""
    while continuation_token is not None:
        blobs = container.list_blobs(results_per_page=2).by_page()
        page = next(blobs)
        queued_pages.append(page)
        continuation = blobs.continuation_token
def task2():
    # to consume queued_pages

# then submit these two tasks in two threads

Hopefully this will help increase performance, if it's still slow probably async SDK could help a bit!
Let me know if there's anything I can help

@mockodin
Copy link
Author

mockodin commented Jun 3, 2020

Hi @xiafu-msft

I tested the above method, to list 50K blobs is consistently ~40-45 seconds, that being irrespective of page size. ~99.9847% of the processing time centers on "next()" operation.

Example:

from azure.storage.blob import ContainerClient, StandardBlobTier
from collections import deque
import time

def GetBlobInfo(Size, BatchSize):
    BlobBatch = deque()
    Start = time.time()
    Count = 0
    inc = 0
    conn_str='<conn string here>'
    container_name='container_name'
    container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
    continuation_token = ""
    while continuation_token is not None:
        print("1: %s" % str(time.time()-Start))
        blobs = container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token)
        print("2: %s" % str(time.time()-Start))
        page = next(blobs)
        print("3: %s" % str(time.time()-Start))
        BlobBatch.append(page)
        continuation_token = blobs.continuation_token
        print("4: %s" % str(time.time()-Start))
        Count += Size
        inc += 1
        print("5: %s" % str(time.time()-Start))
        if not continuation_token or Count == BatchSize:
            End = time.time()
            print("Inc = %s, Total = %s : %s seconds ( %s second avg loop) " % (Size, BatchSize, End-Start, (End-Start)/inc))
            break
        print("6: %s - %s" % (str(time.time()-Start), str(Count)))

#Get 10000 blobs in batches of 5000
>>> GetBlobInfo(5000,10000)
1: 0.0006246566772460938
2: 0.0007045269012451172
3: 4.607837677001953
4: 4.607925176620483
5: 4.607938289642334
6: 4.607948303222656 - 5000
1: 4.607959270477295
2: 4.608181715011597
3: 6.717451333999634
4: 6.717547655105591
5: 6.7175612449646
Inc = 5000, Total = 10000 : 6.717571496963501 seconds ( 3.3587857484817505 second avg loop)

Moving the next operation outside to be handled in parallel threads would have huge performance impact. I think with next() to generate the continuation_token that will remain be the choke point.

You mentioned async which I did start looking at, which for transparency I am new to. I was able run a simple test as follows performance which was consistent with non-async performance. Which I think is probably expected for the example below, next() is still occurring in the background.

import asyncio
import time
from azure.storage.blob.aio import ContainerClient
conn_str='<conn string here>'
container_name='container_name'
container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)

async def test():
    start = time.time()
    blob_list = []
    async for blob in container.list_blobs():
        blob_list.append(blob)
        if len(blob_list) == 50000:
            break
    print('runtime: %s' % str(time.time()-start))

asyncio.run(test())

I'm also trying by_page under async, but getting "TypeError: 'AsyncList' object is not an iterator" for the example below. Any references / links to examples would be useful.

from azure.storage.blob.aio import ContainerClient
import asyncio

async def ReadContainerBlobs():
    continuation_token = ""
    conn_str='<conn string here>'
    container_name='container_name'
    container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
    while continuation_token is not None:
        async for blobbatch in container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token):
            page = await next(blobbatch)
            #blobbatches.append(page) # just testing to we won't keep the data
            continuation_token = await blobs.blobbatch
        break # just testing so exit while after one page

asyncio.run(ReadContainerBlobs())

Async looks like it could provide functionality I attempted to build with threading and multiprocessing though perhaps more native and efficiently... if it will work....

@xiafu-msft
Copy link
Contributor

Hi @mockodin

continuation_token is returned by service to indicate where you want to continue. next() is to call the next page of items(in this case 5000 items) and get the continuation token, it's impossible that we can get continuation without sending request first.
Most time was consumed on waiting service response, so you can use up the waiting time to process BlobBatch.
What we can optimize a bit is: after getting service response we will return continuation without parsing the raw blob items first. While it will not help much because the time is mainly consumed by waiting service response.

for async iterate by page, it's working in this way:

async_page_iterator = container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token)

next_page = await async_page_iterator.__anext__()

blobs = []
async for blob in next_page:
    blobs..append(blob)

@xiafu-msft
Copy link
Contributor

Hi @mockodin

Do you have any other concern or question about this issue?

@amishra-dev
Copy link

Please reactivate if you have further questions or concerns.

openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-python that referenced this issue Nov 24, 2020
Fix s360 kpis for 2020-03-01. (Azure#11593)

* Fix s360 kpis for 2020-03-01.

* Delete Caches_Delete.json.bak

Remove backup file that was accidentally pushed.

* Delete Caches_Flush.json.bak

Remove backup file that was accidentally pushed.

* Delete Caches_Start.json.bak

Remove backup file that was accidentally pushed.

* Delete Caches_Stop.json.bak

Remove backup file that was accidentally pushed.

* Delete Caches_UpgradeFirmware.json.bak

Remove backup file that was accidentally pushed.

* Delete StorageTargets_Delete.json.bak

Remove backup file that was accidentally pushed.
@wglane
Copy link

wglane commented Dec 13, 2021

@xiafu-msft

Hi Xiaoxi,

I'm also running into this problem. Paging through the blobs (using size=5000) is still quite slow, about two seconds per page. I am asynchronously processing the blobs, but is there any way we can speed up the paging?

@alex-schmid-zeiss
Copy link

Hi @mockodin

continuation_token is returned by service to indicate where you want to continue. next() is to call the next page of items(in this case 5000 items) and get the continuation token, it's impossible that we can get continuation without sending request first. Most time was consumed on waiting service response, so you can use up the waiting time to process BlobBatch. What we can optimize a bit is: after getting service response we will return continuation without parsing the raw blob items first. While it will not help much because the time is mainly consumed by waiting service response.

for async iterate by page, it's working in this way:

async_page_iterator = container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token)

next_page = await async_page_iterator.__anext__()

blobs = []
async for blob in next_page:
    blobs..append(blob)

Hey @xiafu-msft I have tried your approach, but unfortunately I'm getting an error: BlobPropertiesPaged' object has no attribute '__anext__ Can you help me?

@vholmer
Copy link

vholmer commented Jul 5, 2022

@Alex97schmid-zeiss unless you already figured this out, you need to import azure.storage.blob.aio instead of just azure.storage.blob.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

7 participants