-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
list_blobs is slow when paging through large containers #11593
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. |
Hi @mockodin Thanks for reaching out!
you can submit each block as a task, then every time a task finishes, you can grab the continuation_token and submit a new task Would you like to explain more about "requiring the list to be consumed first", sorry that I didn't get it clearly. Let me know if this doesn't help! |
@xiafu-msft Answering and expanding slightly: Code Example:
Ultimately my output is a list of blobs with individual blob size and tier for reporting. |
Hi @mockodin Currently you can optimize it by only call next(blobs) first instead of list(next(blobs)) so that to avoid one round of iteration and see if that helps:
Hopefully this will help increase performance, if it's still slow probably async SDK could help a bit! |
Hi @xiafu-msft I tested the above method, to list 50K blobs is consistently ~40-45 seconds, that being irrespective of page size. ~99.9847% of the processing time centers on "next()" operation. Example:
Moving the next operation outside to be handled in parallel threads would have huge performance impact. I think with next() to generate the continuation_token that will remain be the choke point. You mentioned async which I did start looking at, which for transparency I am new to. I was able run a simple test as follows performance which was consistent with non-async performance. Which I think is probably expected for the example below, next() is still occurring in the background.
I'm also trying by_page under async, but getting "TypeError: 'AsyncList' object is not an iterator" for the example below. Any references / links to examples would be useful.
Async looks like it could provide functionality I attempted to build with threading and multiprocessing though perhaps more native and efficiently... if it will work.... |
Hi @mockodin continuation_token is returned by service to indicate where you want to continue. next() is to call the next page of items(in this case 5000 items) and get the continuation token, it's impossible that we can get continuation without sending request first. for async iterate by page, it's working in this way:
|
Hi @mockodin Do you have any other concern or question about this issue? |
Please reactivate if you have further questions or concerns. |
Fix s360 kpis for 2020-03-01. (Azure#11593) * Fix s360 kpis for 2020-03-01. * Delete Caches_Delete.json.bak Remove backup file that was accidentally pushed. * Delete Caches_Flush.json.bak Remove backup file that was accidentally pushed. * Delete Caches_Start.json.bak Remove backup file that was accidentally pushed. * Delete Caches_Stop.json.bak Remove backup file that was accidentally pushed. * Delete Caches_UpgradeFirmware.json.bak Remove backup file that was accidentally pushed. * Delete StorageTargets_Delete.json.bak Remove backup file that was accidentally pushed.
Hi Xiaoxi, I'm also running into this problem. Paging through the blobs (using |
Hey @xiafu-msft I have tried your approach, but unfortunately I'm getting an error: |
@Alex97schmid-zeiss unless you already figured this out, you need to import azure.storage.blob.aio instead of just azure.storage.blob. |
list_blobs is slow when paging through a container containing millions of blobs where prefix filtering is not viable (millions of guids). Scanning a container of ~70M objects takes 8-12 hours.
Iterating a container for blob properties should allow for concurrent page calls. This should be either internally controlled/throttled (similar in concept to threaded azcopy downloads) or by providing continuation_tokens on initial response rather than requiring the list to be consumed first.
The text was updated successfully, but these errors were encountered: