write behavior for empty chunks #2015

d-v-b · 2024-07-05T07:53:35Z

In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely fill_value) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.

We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:

emulate v2: provide a keyword argument like write_empty_chunks when accessing an array. All chunk writes from that array will be affected.
put the write_empty_chunks setting in a global config. All chunk writes from all arrays in a session will be affected by the config parameter.
design an API for array IO wherein IO is wrapped in a context that can be parametrized, e.g. with a context manager, and one of those parameters is the write_empty_chunks-ness of the write transaction. Highly speculative.

The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).

I would propose that we emulate v2 for now (i.e., make write_empty_chunks a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.

cc @constantinpape

The text was updated successfully, but these errors were encountered:

constantinpape · 2024-07-05T08:49:16Z

My 2cents: I personally think this should be the default behavior. We have use-cases where writing empty chunks is extremely bad (as in bringing us over IO node quota immediately and killing everyone's jobs) and I am hesitant to give a library to students which has this as a default behavior.

Besides this, I think that both option 1 and 2 sound ok to me.

d-v-b · 2024-07-05T09:17:55Z

My 2cents: I personally think this should be the default behavior.

I tend to agree, the one counter-point is that for dense arrays, i.e. those arrays where all chunks have useful data in them, inspecting each chunk adds some overhead to writing. But I think our design should err on the side of using up some extra CPU cycles over potentially generating massive numbers of useless files.

rabernat · 2024-07-05T13:49:08Z

inspecting each chunk adds some overhead to writing

Here is some data on that from my laptop

dtype	bytes	time (s)	throughput (MB/s)
i2	2000	1.40e-04	14.31
i2	2000000	2.53e-04	7897.34
i2	2000000000	6.26e-01	3195.49
i4	4000	9.53e-05	41.98
i4	4000000	2.01e-04	19863.44
i4	4000000000	7.60e-01	5261.13
i8	8000	3.64e-05	219.93
i8	8000000	2.54e-04	31506.36
i8	8000000000	1.24e+00	6441.56
f4	4000	6.52e-05	61.38
f4	4000000	2.60e-04	15382.13
f4	4000000000	7.29e-01	5490.37
f8	8000	4.01e-05	199.38
f8	8000000	2.49e-04	32074.79
f8	8000000000	1.19e+00	6706.42

Given that the throughput is many GB/s for larger chunk sizes, this seems unlikely to be a rate limiting step for most I/O bound workloads against disk or cloud storage. So I think it's a fine default.

Does it make sense to think of this as an Array -> Array codec which may just abort the entire writing pipeline?

import numpy as np
for dtype in ['i2', 'i4', 'i8', 'f4', 'f8']:
    for n in [1000, 1_000_000, 1_000_000_000]:
        data = np.zeros(n, dtype=dtype)
        nbytes = data.nbytes
        tic = perf_counter()
        np.any(data > 0)
        toc = perf_counter() - tic
        throughput = nbytes / toc / 1e6
        print(f"| {dtype} | {nbytes} | {toc:3.2e} | {throughput:4.2f} |")

jhamman · 2024-10-23T17:05:25Z

xref: #2409

dstansby · 2024-12-30T18:01:13Z

Was this fixed by #2429?

dstansby added this to the 3.0.0 milestone Dec 30, 2024

jni mentioned this issue Jan 1, 2025

Feat/write empty chunks #2429

Merged

6 tasks

jhamman closed this as completed Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write behavior for empty chunks #2015

write behavior for empty chunks #2015

d-v-b commented Jul 5, 2024

constantinpape commented Jul 5, 2024

d-v-b commented Jul 5, 2024

rabernat commented Jul 5, 2024

jhamman commented Oct 23, 2024

dstansby commented Dec 30, 2024

write behavior for empty chunks #2015

write behavior for empty chunks #2015

Comments

d-v-b commented Jul 5, 2024

constantinpape commented Jul 5, 2024

d-v-b commented Jul 5, 2024

rabernat commented Jul 5, 2024

jhamman commented Oct 23, 2024

dstansby commented Dec 30, 2024