Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/concurrent members #2519

Merged
merged 37 commits into from
Jan 7, 2025
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Nov 26, 2024

Makes AsyncGroup.members() fetch keys and fetch metadata concurrently, which provides a big performance win for high-latency storage backends. The number of concurrent operations is limited by the zarr-wide configuration setting.

in main, Group.members() requires ~O(num_members) time to complete, because it does not perform IO concurrently. In this PR, Group.members runs in constant time (until the number of concurrent requests exceeds the concurrency limit).

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 26, 2024

note: this PR depends on #2474

@d-v-b d-v-b requested review from jhamman and TomAugspurger and removed request for jhamman November 26, 2024 14:42
@jhamman jhamman added the V3 label Nov 29, 2024
src/zarr/core/group.py Outdated Show resolved Hide resolved
@dstansby dstansby removed the V3 label Dec 12, 2024
# as opposed to a prefix, in the store under the prefix associated with this group
# in which case `key` cannot be the name of a sub-array or sub-group.
warnings.warn(
f"Object at {e.args[0]} is not recognized as a component of a Zarr hierarchy.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically possible for KeyError.args to be empty, so this [0] would raise an IndexError:

IndexError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 KeyError().args[0]

If we're comfortable assuming / requiring that things raising a KeyError here will populate that with the key, then I think this is fine to ignore. Otherwise, we might want to catch that IndexError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that args will always be populated, because we are handling exceptions coming downstream of a function that necessarily takes a concrete key to query from storage. For that reason, even if I did add code to handle the case when e.args is not populated, I have no idea how we would test this case with our current Group api, since there's no way that I know of to reach this line without some concrete keys.

manager provided by that semaphore. If the semaphore parameter is None, then getitem is invoked
without a context manager.
"""
if semaphore is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick: I like contextlib.nullcontext for cases where you may or may not have a context manager: https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext

3.10 added support for asynchronous context managers, so this should be usable here:

semaphore = semaphore or contextlib.nullcontext()
async with semaphore:
    ....

And I wonder whether getitem_semaphore should be the one to look up async.concurrency from the zarr config? Maybe we want that to be the default so that we don't miss it anywhere (and have some other way to indicate unbounded concurrency?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried contextlib.nullcontext but I didn't feel like the LOC saved was worth the added indirection in this case. I assume that contextlib.nullcontext doesn't add notable performance overhead, by contrast, I find the current if... else construction easier to reason about, and it requires 1 fewer import.

And I wonder whether getitem_semaphore should be the one to look up async.concurrency from the zarr config? Maybe we want that to be the default so that we don't miss it anywhere (and have some other way to indicate unbounded concurrency?)

As I understand it, we only query the concurrency limit when the semaphore is created, which is necessarily before getitem_semaphore gets called. All invocations of getitem_semaphore have to use the same Semaphore instance, because otherwise there would be no coordination mechanism for rate-limiting. So I think that means getitem_semaphore shouldn't know anything about the config, or even concurrency limits.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 7, 2025

@dstansby how can I re-rerun the docs build? and do you have any idea why it failed?

@dstansby
Copy link
Contributor

dstansby commented Jan 7, 2025

Do you have a readtheodcs account? If you send me the email or username for that, I can add you as a maintainer to the project on readthedocs.org. Then you'll get a "Rebuild this build" button:

Screenshot 2025-01-07 at 14 22 48

I'm not sure why this one failed... I will restart!

@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 7, 2025

thanks for fixing it @dstansby!

@d-v-b d-v-b merged commit bc5877b into zarr-developers:main Jan 7, 2025
28 checks passed
@d-v-b d-v-b deleted the feat/concurrent-members branch January 7, 2025 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants