-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/write empty chunks #2429
Feat/write empty chunks #2429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good approach. Should we add some backwards compatibility thing for the write_empty_chunks
kwarg in zarr.open?
I think that In [41]: x = np.random.randn(10, 10, 10)
In [42]: x2, y = np.broadcast_arrays(x, 0)
In [43]: x is x2 # No copy of the array is created
Out[43]: True
In [44]: y.base # Only a single value is allocated for the fill value array data. It'd be nice to avoid the equality check when writing, at least under some circumstances, but I haven't thought of an easy way to do that. |
Are you thinking of something like a warning to guide people to use the configuration approach, if they pass in |
Yes, a warning and maybe even setting the config for them? |
I think a warning is a good idea but I'm hesitant to have any runtime code that sets config variables beyond the initial setup. IMO we are better off treating it as immutable, and leaving it to users to set. I think we can afford to just do a warning here because user code won't break if |
That sounds reasonable |
I've forgotten the code path now, but if zarr creates the empty chunk using |
src/zarr/core/array.py
Outdated
@@ -465,6 +473,8 @@ async def create( | |||
else: | |||
_chunks = normalize_chunks(chunk_shape, shape, dtype_parsed.itemsize) | |||
|
|||
config_parsed = ArrayConfig(order=order, write_empty_chunks=write_empty_chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we warn here as well?
I added an I added deprecation warnings for both the |
…n into feat/write-empty-chunks
@normanrz do you think we should surface the |
I was thinking the same thing. We could expose it as |
I renamed |
@normanrz do you think we should add |
a catch-all |
src/zarr/core/array_spec.py
Outdated
order: MemoryOrder | ||
write_empty_chunks: bool | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this design prevents us from having None
as a valid value for a configuration parameter, because it will not be possible to distinguish "parameter was unset" from "parameter was set to None
". A solution is to use a classmethod that takes a dictionary where some keys may not be present, which unambiguously expresses a missing parameter. I am working on this.
Yeah. That could work |
nice, i'm hacking on this rn |
After working on this a bit, I think we can't deprecate the by contrast, |
Ok. Are you working on this or do you want help? |
i'm working on it! |
…from config order
|
Co-authored-by: Norman Rzepka <[email protected]>
Co-authored-by: Norman Rzepka <[email protected]>
Requested changes have been addressed in the meantime.
Is there a quick way (documentation/docstrings) for users to know that the new default for @d-v-b I'm also not entirely sure why you went with option 2 from #2015 (re: mutable global state), though it's alleviated by passing config to the write functions + context managers. I actually quite like the current implementation despite having the same reservation, I'd just love to see a fully-worked-out rationale for the chosen option, maybe in an update to the top-level description of this PR. (Which could also show the other ways of setting config other than the context manager.) |
we definitely need an easy way to document that the default value is As for the explanation of the merged PR, I'd rather not change the top-level description of the PR at this point but I can add my thoughts here: The default value of the Of course another solution would be to deprecate |
This PR adds a boolean
array.write_empty_chunks
value to the global config, and uses this value to control whether chunks that are "empty", i.e. filled with values equivalent to the array's fill value, are written to storage.In
zarr-python
2.x,write_empty_chunks
was a property of anArray
that users specified when creating theArray
object. This had pros and cons which I'm happy to discuss if people are interested, but the tl;dr is that the cons of that approach are driving my decision in this PR to makewrite_empty_chunks
a global runtime property accessible via the config API.Usage looks something like this (
donfig
experts please correct me if there's a better way):If people hate this, then we can definitely change this API. I'm very open to discussion here.
Also worth noting:
Our check for whether a chunk is equal to the fill value is pretty inefficient -- it's allocating a new array for every check invocation. This can definitely be made more efficient, in a stupid way by caching an all-fill-value chunk on the array instance and using that for the comparison, or a smarter way by doing the
(chunk, fill_value)
comparison without allocating a new array. But I think this is an effort for a separate PR.closes #2409
TODO: