-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write behavior for empty chunks #2015
Comments
My 2cents: I personally think this should be the default behavior. We have use-cases where writing empty chunks is extremely bad (as in bringing us over IO node quota immediately and killing everyone's jobs) and I am hesitant to give a library to students which has this as a default behavior. Besides this, I think that both option 1 and 2 sound ok to me. |
I tend to agree, the one counter-point is that for dense arrays, i.e. those arrays where all chunks have useful data in them, inspecting each chunk adds some overhead to writing. But I think our design should err on the side of using up some extra CPU cycles over potentially generating massive numbers of useless files. |
Here is some data on that from my laptop
Given that the throughput is many GB/s for larger chunk sizes, this seems unlikely to be a rate limiting step for most I/O bound workloads against disk or cloud storage. So I think it's a fine default. Does it make sense to think of this as an Array -> Array codec which may just abort the entire writing pipeline?
|
xref: #2409 |
Was this fixed by #2429? |
In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely
fill_value
) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:
write_empty_chunks
when accessing an array. All chunk writes from that array will be affected.write_empty_chunks
setting in a global config. All chunk writes from all arrays in a session will be affected by the config parameter.The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).
I would propose that we emulate v2 for now (i.e., make
write_empty_chunks
a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.cc @constantinpape
The text was updated successfully, but these errors were encountered: