list_blobs is slow when paging through large containers #11593

mockodin · 2020-05-21T22:40:28Z

list_blobs is slow when paging through a container containing millions of blobs where prefix filtering is not viable (millions of guids). Scanning a container of ~70M objects takes 8-12 hours.

Iterating a container for blob properties should allow for concurrent page calls. This should be either internally controlled/throttled (similar in concept to threaded azcopy downloads) or by providing continuation_tokens on initial response rather than requiring the list to be consumed first.

ghost · 2020-05-22T15:59:37Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

xiafu-msft · 2020-05-27T07:05:23Z

Hi @mockodin

Thanks for reaching out!
Did you notice that you can use by_page() in this way:

        # block1
        generator1 = bsc.list_containers(results_per_page=2).by_page()
        page1 = next(generator1)
        containers1 = list(page1)

        # block2
        generator2 = bsc.list_containers(results_per_page=2).by_page(generator1.continuation_token)
        page2 = next(generator2)
        containers1 = list(page2)

you can submit each block as a task, then every time a task finishes, you can grab the continuation_token and submit a new task

Would you like to explain more about "requiring the list to be consumed first", sorry that I didn't get it clearly.

Let me know if this doesn't help!

mockodin · 2020-05-29T15:54:57Z

@xiafu-msft
The example you gave is the method I'm using today (sample included bellow).

Answering and expanding slightly:
continuation_token is None until the paged return has been read (aka consumed) I do this via "list(next(blobs))". Each iteration take 5-10 seconds to complete. Calling a smaller page size does not seem to impact interval time significantly, if anything lengthens the overall processing time.

Code Example:

from azure.storage.blob import ContainerClient, StandardBlobTier
from azure.core.exceptions import ServiceResponseError, ServiceRequestError, ClientAuthenticationError, HttpResponseError
from queue import Queue

conn_str='<conn string here>'
Blobs = Queue()
continuation_token = None
container_client = ContainerClient.from_connection_string(conn_str=conn_str)
while True:
    try:               
        blobs = container_client.list_blobs(results_per_page=5000).by_page(continuation_token=continuation_token)
        Blobs.put([blob for blob in list(next(blobs))])
        continuation_token = blobs.continuation_token
        if not continuation_token:
            break
    except ClientAuthenticationError:
        container_client = ContainerClient.from_connection_string(conn_str=conn_str)

Ultimately my output is a list of blobs with individual blob size and tier for reporting.

xiafu-msft · 2020-06-02T18:35:04Z

Hi @mockodin
Thanks for reporting this, we found something we can optimize on SDK side. Currently SDK go over the list of blobs first to transform raw response into customer facing class when calling next(blobs). While it shouldn't increase the processing time that much.

Currently you can optimize it by only call next(blobs) first instead of list(next(blobs)) so that to avoid one round of iteration and see if that helps:

queued_pages = deque()
def task1():
    continuation_token = ""
    while continuation_token is not None:
        blobs = container.list_blobs(results_per_page=2).by_page()
        page = next(blobs)
        queued_pages.append(page)
        continuation = blobs.continuation_token
def task2():
    # to consume queued_pages

# then submit these two tasks in two threads

Hopefully this will help increase performance, if it's still slow probably async SDK could help a bit!
Let me know if there's anything I can help

mockodin · 2020-06-03T18:30:42Z

Hi @xiafu-msft

I tested the above method, to list 50K blobs is consistently ~40-45 seconds, that being irrespective of page size. ~99.9847% of the processing time centers on "next()" operation.

Example:

from azure.storage.blob import ContainerClient, StandardBlobTier
from collections import deque
import time

def GetBlobInfo(Size, BatchSize):
    BlobBatch = deque()
    Start = time.time()
    Count = 0
    inc = 0
    conn_str='<conn string here>'
    container_name='container_name'
    container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
    continuation_token = ""
    while continuation_token is not None:
        print("1: %s" % str(time.time()-Start))
        blobs = container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token)
        print("2: %s" % str(time.time()-Start))
        page = next(blobs)
        print("3: %s" % str(time.time()-Start))
        BlobBatch.append(page)
        continuation_token = blobs.continuation_token
        print("4: %s" % str(time.time()-Start))
        Count += Size
        inc += 1
        print("5: %s" % str(time.time()-Start))
        if not continuation_token or Count == BatchSize:
            End = time.time()
            print("Inc = %s, Total = %s : %s seconds ( %s second avg loop) " % (Size, BatchSize, End-Start, (End-Start)/inc))
            break
        print("6: %s - %s" % (str(time.time()-Start), str(Count)))

#Get 10000 blobs in batches of 5000
>>> GetBlobInfo(5000,10000)
1: 0.0006246566772460938
2: 0.0007045269012451172
3: 4.607837677001953
4: 4.607925176620483
5: 4.607938289642334
6: 4.607948303222656 - 5000
1: 4.607959270477295
2: 4.608181715011597
3: 6.717451333999634
4: 6.717547655105591
5: 6.7175612449646
Inc = 5000, Total = 10000 : 6.717571496963501 seconds ( 3.3587857484817505 second avg loop)

Moving the next operation outside to be handled in parallel threads would have huge performance impact. I think with next() to generate the continuation_token that will remain be the choke point.

You mentioned async which I did start looking at, which for transparency I am new to. I was able run a simple test as follows performance which was consistent with non-async performance. Which I think is probably expected for the example below, next() is still occurring in the background.

import asyncio
import time
from azure.storage.blob.aio import ContainerClient
conn_str='<conn string here>'
container_name='container_name'
container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)

async def test():
    start = time.time()
    blob_list = []
    async for blob in container.list_blobs():
        blob_list.append(blob)
        if len(blob_list) == 50000:
            break
    print('runtime: %s' % str(time.time()-start))

asyncio.run(test())

I'm also trying by_page under async, but getting "TypeError: 'AsyncList' object is not an iterator" for the example below. Any references / links to examples would be useful.

from azure.storage.blob.aio import ContainerClient
import asyncio

async def ReadContainerBlobs():
    continuation_token = ""
    conn_str='<conn string here>'
    container_name='container_name'
    container_client = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
    while continuation_token is not None:
        async for blobbatch in container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token):
            page = await next(blobbatch)
            #blobbatches.append(page) # just testing to we won't keep the data
            continuation_token = await blobs.blobbatch
        break # just testing so exit while after one page

asyncio.run(ReadContainerBlobs())

Async looks like it could provide functionality I attempted to build with threading and multiprocessing though perhaps more native and efficiently... if it will work....

xiafu-msft · 2020-06-03T19:55:25Z

Hi @mockodin

continuation_token is returned by service to indicate where you want to continue. next() is to call the next page of items(in this case 5000 items) and get the continuation token, it's impossible that we can get continuation without sending request first.
Most time was consumed on waiting service response, so you can use up the waiting time to process BlobBatch.
What we can optimize a bit is: after getting service response we will return continuation without parsing the raw blob items first. While it will not help much because the time is mainly consumed by waiting service response.

for async iterate by page, it's working in this way:

async_page_iterator = container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token)

next_page = await async_page_iterator.__anext__()

blobs = []
async for blob in next_page:
    blobs..append(blob)

xiafu-msft · 2020-07-13T19:32:51Z

Hi @mockodin

Do you have any other concern or question about this issue?

amishra-dev · 2020-07-20T08:22:54Z

Please reactivate if you have further questions or concerns.

Fix s360 kpis for 2020-03-01. (Azure#11593) * Fix s360 kpis for 2020-03-01. * Delete Caches_Delete.json.bak Remove backup file that was accidentally pushed. * Delete Caches_Flush.json.bak Remove backup file that was accidentally pushed. * Delete Caches_Start.json.bak Remove backup file that was accidentally pushed. * Delete Caches_Stop.json.bak Remove backup file that was accidentally pushed. * Delete Caches_UpgradeFirmware.json.bak Remove backup file that was accidentally pushed. * Delete StorageTargets_Delete.json.bak Remove backup file that was accidentally pushed.

wglane · 2021-12-13T22:14:39Z

@xiafu-msft

Hi Xiaoxi,

I'm also running into this problem. Paging through the blobs (using size=5000) is still quite slow, about two seconds per page. I am asynchronously processing the blobs, but is there any way we can speed up the paging?

alex-schmid-zeiss · 2022-05-19T10:02:17Z

Hi @mockodin

continuation_token is returned by service to indicate where you want to continue. next() is to call the next page of items(in this case 5000 items) and get the continuation token, it's impossible that we can get continuation without sending request first. Most time was consumed on waiting service response, so you can use up the waiting time to process BlobBatch. What we can optimize a bit is: after getting service response we will return continuation without parsing the raw blob items first. While it will not help much because the time is mainly consumed by waiting service response.

for async iterate by page, it's working in this way:
async_page_iterator = container_client.list_blobs(results_per_page=Size).by_page(continuation_token=continuation_token)

next_page = await async_page_iterator.__anext__()

blobs = []
async for blob in next_page:
    blobs..append(blob)

Hey @xiafu-msft I have tried your approach, but unfortunately I'm getting an error: BlobPropertiesPaged' object has no attribute '__anext__ Can you help me?

vholmer · 2022-07-05T08:21:15Z

@Alex97schmid-zeiss unless you already figured this out, you need to import azure.storage.blob.aio instead of just azure.storage.blob.

mockodin changed the title ~~list_blobs is slow when paging through large container~~ list_blobs is slow when paging through large containers May 21, 2020

lmazuel added Client This issue points to a problem in the data-plane of the library. Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files) labels May 22, 2020

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label May 22, 2020

lmazuel assigned xiafu-msft May 22, 2020

amishra-dev closed this as completed Jul 20, 2020

b-c-lucas mentioned this issue Mar 10, 2022

Listing blobs names is very slow #19755

Closed

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list_blobs is slow when paging through large containers #11593

list_blobs is slow when paging through large containers #11593

mockodin commented May 21, 2020

ghost commented May 22, 2020

xiafu-msft commented May 27, 2020 •

edited

Loading

mockodin commented May 29, 2020 •

edited

Loading

xiafu-msft commented Jun 2, 2020 •

edited

Loading

mockodin commented Jun 3, 2020

xiafu-msft commented Jun 3, 2020

xiafu-msft commented Jul 13, 2020

amishra-dev commented Jul 20, 2020

wglane commented Dec 13, 2021

alex-schmid-zeiss commented May 19, 2022

vholmer commented Jul 5, 2022

list_blobs is slow when paging through large containers #11593

list_blobs is slow when paging through large containers #11593

Comments

mockodin commented May 21, 2020

ghost commented May 22, 2020

xiafu-msft commented May 27, 2020 • edited Loading

mockodin commented May 29, 2020 • edited Loading

xiafu-msft commented Jun 2, 2020 • edited Loading

mockodin commented Jun 3, 2020

xiafu-msft commented Jun 3, 2020

xiafu-msft commented Jul 13, 2020

amishra-dev commented Jul 20, 2020

wglane commented Dec 13, 2021

alex-schmid-zeiss commented May 19, 2022

vholmer commented Jul 5, 2022

xiafu-msft commented May 27, 2020 •

edited

Loading

mockodin commented May 29, 2020 •

edited

Loading

xiafu-msft commented Jun 2, 2020 •

edited

Loading