Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/write empty chunks #2429

Merged
merged 21 commits into from
Dec 20, 2024
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 22, 2024

This PR adds a boolean array.write_empty_chunks value to the global config, and uses this value to control whether chunks that are "empty", i.e. filled with values equivalent to the array's fill value, are written to storage.

In zarr-python 2.x, write_empty_chunks was a property of an Array that users specified when creating the Array object. This had pros and cons which I'm happy to discuss if people are interested, but the tl;dr is that the cons of that approach are driving my decision in this PR to make write_empty_chunks a global runtime property accessible via the config API.

Usage looks something like this (donfig experts please correct me if there's a better way):

with config.set({"array.write_empty_chunks": write_empty_chunks}):
    arr[:] = fill_value

If people hate this, then we can definitely change this API. I'm very open to discussion here.

Also worth noting:

Our check for whether a chunk is equal to the fill value is pretty inefficient -- it's allocating a new array for every check invocation. This can definitely be made more efficient, in a stupid way by caching an all-fill-value chunk on the array instance and using that for the comparison, or a smarter way by doing the (chunk, fill_value) comparison without allocating a new array. But I think this is an effort for a separate PR.

closes #2409

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b requested review from jhamman and normanrz October 22, 2024 10:15
Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach. Should we add some backwards compatibility thing for the write_empty_chunks kwarg in zarr.open?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 22, 2024

Also worth noting:

Our check for whether a chunk is equal to the fill value is pretty inefficient

I think that broadcast_arrays is smart in a couple of ways when it's an array, scalar operation.

In [41]: x = np.random.randn(10, 10, 10)

In [42]: x2, y = np.broadcast_arrays(x, 0)

In [43]: x is x2  # No copy of the array is created
Out[43]: True

In [44]: y.base  # Only a single value is allocated for the fill value array data.

It'd be nice to avoid the equality check when writing, at least under some circumstances, but I haven't thought of an easy way to do that.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 22, 2024

Should we add some backwards compatibility thing for the write_empty_chunks kwarg in zarr.open?

Are you thinking of something like a warning to guide people to use the configuration approach, if they pass in write_empty_chunks?

@normanrz
Copy link
Member

Yes, a warning and maybe even setting the config for them?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 22, 2024

Yes, a warning and maybe even setting the config for them?

I think a warning is a good idea but I'm hesitant to have any runtime code that sets config variables beyond the initial setup. IMO we are better off treating it as immutable, and leaving it to users to set. I think we can afford to just do a warning here because user code won't break if write_empty_chunks is set to the wrong value.

@normanrz
Copy link
Member

That sounds reasonable

@d-v-b d-v-b marked this pull request as ready for review October 22, 2024 16:37
@dcherian
Copy link
Contributor

dcherian commented Oct 22, 2024

It'd be nice to avoid the equality check when writing

I've forgotten the code path now, but if zarr creates the empty chunk using np.broadcast_to(self.fill_value, chunk_shape) when we might just check equality of .base or something like that.

src/zarr/codecs/pipeline.py Outdated Show resolved Hide resolved
@jhamman jhamman added the V3 label Nov 29, 2024
@d-v-b d-v-b marked this pull request as draft November 29, 2024 15:41
@dstansby dstansby removed the V3 label Dec 12, 2024
@@ -465,6 +473,8 @@ async def create(
else:
_chunks = normalize_chunks(chunk_shape, shape, dtype_parsed.itemsize)

config_parsed = ArrayConfig(order=order, write_empty_chunks=write_empty_chunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we warn here as well?

@normanrz
Copy link
Member

I added an ArrayConfig class that holds order and write_empty_chunks. Both are initialized when the Array object is initialized. By default, the values from the global config are used.

I added deprecation warnings for both the order and write_empty_chunks kwargs.

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 19, 2024

@normanrz do you think we should surface the .config attribute on the Array class?

@d-v-b d-v-b marked this pull request as ready for review December 19, 2024 14:07
@normanrz
Copy link
Member

@normanrz do you think we should surface the .config attribute on the Array class?

I was thinking the same thing. We could expose it as _config.

@normanrz
Copy link
Member

I renamed AsyncArray.config to AsyncArray._config in 08a2d52

@normanrz normanrz requested a review from jhamman December 19, 2024 19:05
@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 20, 2024

@normanrz do you think we should add config as a keyword argument to array creation routines? The type of this parameter would be ArrayConfig | TypedDictThat'sLikeAnArrayConfig This gives people who are currently controlling write_empty_chunks via a keyword argument a simple way to handle the deprecation warning

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 20, 2024

a catch-all config kwarg also lets us potentially grow the space of runtime array configuration without needing to add a lot of weight to the array constructors

order: MemoryOrder
write_empty_chunks: bool

def __init__(
Copy link
Contributor Author

@d-v-b d-v-b Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design prevents us from having None as a valid value for a configuration parameter, because it will not be possible to distinguish "parameter was unset" from "parameter was set to None". A solution is to use a classmethod that takes a dictionary where some keys may not be present, which unambiguously expresses a missing parameter. I am working on this.

@normanrz
Copy link
Member

@normanrz do you think we should add config as a keyword argument to array creation routines? The type of this parameter would be ArrayConfig | TypedDictThat'sLikeAnArrayConfig This gives people who are currently controlling write_empty_chunks via a keyword argument a simple way to handle the deprecation warning

Yeah. That could work

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 20, 2024

@normanrz do you think we should add config as a keyword argument to array creation routines? The type of this parameter would be ArrayConfig | TypedDictThat'sLikeAnArrayConfig This gives people who are currently controlling write_empty_chunks via a keyword argument a simple way to handle the deprecation warning

Yeah. That could work

nice, i'm hacking on this rn

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 20, 2024

After working on this a bit, I think we can't deprecate the order parameter, because it's part of the v2 array metadata document, and that means it's orthogonal to the runtime memory order parameter. this means there is no contradiction if a user creates a v2 array with order=F and config={'order': 'C'}, although we will need to ensure that this works properly (the memory order would need to be inspected prior to writing to ensure that memory order conversions work)

by contrast, write_empty_chunks is purely runtime.

@normanrz
Copy link
Member

Ok. Are you working on this or do you want help?

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 20, 2024

i'm working on it!

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 20, 2024

  • I removed write_empty_chunks from the Array.create signature. Instead, pass config={'write_empty_chunks': <value>}. The higher-level API creation routines still take write_empty_chunks, but it's deprecated, and those values are used to create an ArrayConfig with the user-provided right write_empty_chunks value.
  • ArrayConfig has a from_dict method that takes a dict; any fields of ArrayConfig that are missing from this dict will be pulled from the global config object
  • we cannot deprecate the order keyword argument, because it's a core part of v2 metadata. But we emit a warning if someone specifies it for a v3 array, because the only degree of freedom there is the config.order parameter.

src/zarr/core/array.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
d-v-b and others added 2 commits December 20, 2024 13:46
Co-authored-by: Norman Rzepka <[email protected]>
Co-authored-by: Norman Rzepka <[email protected]>
@normanrz normanrz dismissed jhamman’s stale review December 20, 2024 13:00

Requested changes have been addressed in the meantime.

@d-v-b d-v-b merged commit 6dc6d07 into zarr-developers:main Dec 20, 2024
27 checks passed
@d-v-b d-v-b deleted the feat/write-empty-chunks branch December 20, 2024 13:10
@jni
Copy link
Contributor

jni commented Jan 1, 2025

Is there a quick way (documentation/docstrings) for users to know that the new default for write_empty_chunks is false? I just read a lot of code before I was able to figure that out. 😂

@d-v-b I'm also not entirely sure why you went with option 2 from #2015 (re: mutable global state), though it's alleviated by passing config to the write functions + context managers. I actually quite like the current implementation despite having the same reservation, I'd just love to see a fully-worked-out rationale for the chosen option, maybe in an update to the top-level description of this PR. (Which could also show the other ways of setting config other than the context manager.)

@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 1, 2025

we definitely need an easy way to document that the default value is False. Tying the default value to the config, which users can change, makes it a bit risky to write "the default value is False" in a docstring, because a user could change the config and thus invalidate that docstring. We could say "unless the global configuration is modified, the default value is False", and ensure that if we ever change the default global config values, we also change that docstring accordingly. I think we could also express the config in a rich repr of the array, so that it's easy to quickly check without hunting through array attributes.

As for the explanation of the merged PR, I'd rather not change the top-level description of the PR at this point but I can add my thoughts here:
We create a new config attribute for arrays, and put write_empty_chunks in the config attribute, and the array config can be transiently altered in a context manager. I'm pretty confident that this is a good fit, because even for a single array, a user might want to skip the "is this chunk empty" check for a select subset of chunks. Put another way, in abstract terms the "is this chunk empty" check is scoped to a single chunk. We don't have a per-chunk configuration, so write_empty_chunks has to live at a higher level of abstraction -- at the array level -- even though it doesn't exactly belong there. Allowing the config to be transiently overridden without mutating the array object itself lets users exert fine-grained control over how their chunks are written -- they can ensure that a subset of the chunks of a given array are written with write_empty_chunks=False, and another subset with True. Even if nobody uses this "override the config for a bit" functionality specifically for write_empty_chunks, I think it could be generally useful, and it's technically a fit for write_empty_chunks.

The default value of the write_empty_chunks parameter comes from the global zarr config. I'm less confident about this decision, but I think it's worth trying out for now. The reason to make the default values context-sensitive is so that users could define an array config in a group, and then the arrays accessed via that group would inherit the group's array config. As long as we support APIs for accessing arrays by name only via group.__getitem__, we need a way to define the config that those arrays use, and this "config inheritance" addresses that. This PR does not add this functionality, but sets the stage for it.

Of course another solution would be to deprecate group.__getitem__ for accessing arrays / groups, and instead recommend a method like get_array(name: str, config: ArrayConfig). but I figured removing __getitem__ would break a lot of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] support write_empty_chunks
7 participants