Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write behavior for empty chunks #2015

Closed
d-v-b opened this issue Jul 5, 2024 · 5 comments
Closed

write behavior for empty chunks #2015

d-v-b opened this issue Jul 5, 2024 · 5 comments
Milestone

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Jul 5, 2024

In v2, at array access time it is possible to set whether empty chunks (defined as chunks that are entirely fill_value) should be written to storage or skipped. This is an extremely useful feature for high-latency storage backends, or in any context where too many objects in storage is burdensome.

We don't support this in v3 yet, but we should. How should we do it? I will throw out a few options in order of practicality:

  • emulate v2: provide a keyword argument like write_empty_chunks when accessing an array. All chunk writes from that array will be affected.
  • put the write_empty_chunks setting in a global config. All chunk writes from all arrays in a session will be affected by the config parameter.
  • design an API for array IO wherein IO is wrapped in a context that can be parametrized, e.g. with a context manager, and one of those parameters is the write_empty_chunks-ness of the write transaction. Highly speculative.

The first option seems pretty expedient, and I don't think we had a lot of problems with this approach in v2. The only drawback is that if people want the same array to exhibit conditional write_empty_chunks behavior, then they might need something like the second approach, which has its own drawbacks IMO (i'm not a big fan of mutable global state).

I would propose that we emulate v2 for now (i.e., make write_empty_chunks a keyword argument to array access) and note any friction this causes, and consider ways to alleviate that in a subsequent design refresh if the friction is severe.

cc @constantinpape

@constantinpape
Copy link

My 2cents: I personally think this should be the default behavior. We have use-cases where writing empty chunks is extremely bad (as in bringing us over IO node quota immediately and killing everyone's jobs) and I am hesitant to give a library to students which has this as a default behavior.

Besides this, I think that both option 1 and 2 sound ok to me.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jul 5, 2024

My 2cents: I personally think this should be the default behavior.

I tend to agree, the one counter-point is that for dense arrays, i.e. those arrays where all chunks have useful data in them, inspecting each chunk adds some overhead to writing. But I think our design should err on the side of using up some extra CPU cycles over potentially generating massive numbers of useless files.

@rabernat
Copy link
Contributor

rabernat commented Jul 5, 2024

inspecting each chunk adds some overhead to writing

Here is some data on that from my laptop

dtype bytes time (s) throughput (MB/s)
i2 2000 1.40e-04 14.31
i2 2000000 2.53e-04 7897.34
i2 2000000000 6.26e-01 3195.49
i4 4000 9.53e-05 41.98
i4 4000000 2.01e-04 19863.44
i4 4000000000 7.60e-01 5261.13
i8 8000 3.64e-05 219.93
i8 8000000 2.54e-04 31506.36
i8 8000000000 1.24e+00 6441.56
f4 4000 6.52e-05 61.38
f4 4000000 2.60e-04 15382.13
f4 4000000000 7.29e-01 5490.37
f8 8000 4.01e-05 199.38
f8 8000000 2.49e-04 32074.79
f8 8000000000 1.19e+00 6706.42

Given that the throughput is many GB/s for larger chunk sizes, this seems unlikely to be a rate limiting step for most I/O bound workloads against disk or cloud storage. So I think it's a fine default.

Does it make sense to think of this as an Array -> Array codec which may just abort the entire writing pipeline?

import numpy as np
for dtype in ['i2', 'i4', 'i8', 'f4', 'f8']:
    for n in [1000, 1_000_000, 1_000_000_000]:
        data = np.zeros(n, dtype=dtype)
        nbytes = data.nbytes
        tic = perf_counter()
        np.any(data > 0)
        toc = perf_counter() - tic
        throughput = nbytes / toc / 1e6
        print(f"| {dtype} | {nbytes} | {toc:3.2e} | {throughput:4.2f} |")

@jhamman
Copy link
Member

jhamman commented Oct 23, 2024

xref: #2409

@dstansby
Copy link
Contributor

Was this fixed by #2429?

@dstansby dstansby added this to the 3.0.0 milestone Dec 30, 2024
@jni jni mentioned this issue Jan 1, 2025
6 tasks
@jhamman jhamman closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants