Extension proposal for zarr accumulation #205

hailiangzhang · 2023-02-14T20:32:25Z

This PR contains the Zarr extension proposal for a method that we developed at NASA GESDISC -- Zarr-based Chunk-level Accumulation, which provides fast and cost-efficient data analysis services based on chunk level statistics. The proposal mainly includes the Zarr group structure and attribute schema for the chunk level statistics.

This related PR against zeps repo is here.

jbms · 2023-02-16T04:32:59Z

Thanks for this proposal. It seems like it may make sense to specify the stride in units of elements rather than chunks. Then this would more generally also cover downsampling.

hailiangzhang · 2023-02-16T16:58:52Z

Thanks for this proposal. It seems like it may make sense to specify the stride in units of elements rather than chunks. Then this would more generally also cover downsampling.

Thanks for your feedback @jbms !

Yes, it will be great that we can make this approach more flexible by specifying the stride in units of elements rather than chunks, but I think it probably okay to save the aggregation statistics at the chunk level since:

Chunk is the smallest transaction unit (IO, compression, filtering and etc), and performance-wise it should be okay to have the aggregation statistics aligned along the chunk boundary (since we need to load the whole chunk anyway).
Setting strides at element level may increase the coding complexity quite a lot.
More importantly, setting strides at chunk level doesn't mean the averaging/aggregation service can only be provided at chunk level. The end users can specify any random temporal range or bounding box, and the internal stride setup should be transparent to them. The application needs to locate the strides at the corner of the averaging boundary, and the averaging/aggregation result will be some simple linear combination of the accumulation values of these strides and internal values inside these chunks at the boundary.

jbms · 2023-03-23T05:47:22Z

@hailiangzhang As I mentioned at the meeting today, it would be very helpful if you can provide some mathematical formulas or pseudocode that describes which cumulative sums you are storing. Your proposal describes the attribute schema but doesn't seem to describe exactly which sums are actually being stored, or how to compute the weighted sums based on the stored values.

It occurs to me that the following strategy would accomplish the goals of your proposal and might be similar to what you are doing:

A summed area table A for a variable X(i, j, k) is defined by: A(i, j, k) = sum_{i1<=i, j1<=j, k1<=k} X(i1,j1,k1). I think this is the strategy you described in your presentation as the standard accumulation strategy. This can be computed in linear time, and allows the sum over any region to be computed with 2^n reads of A, where n is the number of dimensions (3 in this case), but as you noted can run into numerical precision issues.
Instead of storing A fully, we can store only the values A(i, j, k) where i, j, or k is a multiple of some stride ci, cj, ck (e.g. the zarr chunk size), i.e. just the positions on a chunk boundary. This can be computed with the same cost as computing A fully. The storage cost will be approximately 1 / ci + 1 / cj + 1 / ck.
To compute the sum over any region, we need to be able to read the 2^n values from A. To do this, for each read we can reconstruct the necessary chunk of A by reading the stored values of A at the boundary of the chunk, and reading the content of the corresponding chunk of X. This takes time linear in the size of the chunk.
To efficiently store the values of A only at chunk boundaries, we need to somehow pack it into some number of dense arrays. The simplest way to do that would be to have a separate array for each dimension, i.e. Ai(ic, j, k) = A(ic * ci, j, k), Aj(i, jc, k) = A(i, jc * cj, k), Ak(i, j, kc) = A(i, j, kc * ck), which is stored for all i, j, k, ic, jc, kc. This does mean we redundant store values that are on the boundary in more than one dimension, though; a more complicated stored representation could potentially avoid that.
This strategy still does not solve the numerical precision issues, so I suspect it is not exactly the same as what you are doing.

Any clarification you can offer would be very helpful.

hailiangzhang · 2023-03-23T17:55:10Z

@hailiangzhang As I mentioned at the meeting today, it would be very helpful if you can provide some mathematical formulas or pseudocode that describes which cumulative sums you are storing. Your proposal describes the attribute schema but doesn't seem to describe exactly which sums are actually being stored, or how to compute the weighted sums based on the stored values.

It occurs to me that the following strategy would accomplish the goals of your proposal and might be similar to what you are doing:

A summed area table A for a variable X(i, j, k) is defined by: A(i, j, k) = sum_{i1<=i, j1<=j, k1<=k} X(i1,j1,k1). I think this is the strategy you described in your presentation as the standard accumulation strategy. This can be computed in linear time, and allows the sum over any region to be computed with 2^n reads of A, where n is the number of dimensions (3 in this case), but as you noted can run into numerical precision issues.

Instead of storing A fully, we can store only the values A(i, j, k) where i, j, or k is a multiple of some stride ci, cj, ck (e.g. the zarr chunk size), i.e. just the positions on a chunk boundary. This can be computed with the same cost as computing A fully. The storage cost will be approximately 1 / ci + 1 / cj + 1 / ck.

To compute the sum over any region, we need to be able to read the 2^n values from A. To do this, for each read we can reconstruct the necessary chunk of A by reading the stored values of A at the boundary of the chunk, and reading the content of the corresponding chunk of X. This takes time linear in the size of the chunk.

To efficiently store the values of A only at chunk boundaries, we need to somehow pack it into some number of dense arrays. The simplest way to do that would be to have a separate array for each dimension, i.e. Ai(i, j, k) = A(i * ci, j, k), Aj(i, j, k) = A(i, j * cj, k), Ak(i, j, k) = A(i, j, k * ck). This does mean we redundant store values that are on the boundary in more than one dimension, though; a more complicated stored representation could potentially avoid that.

This strategy still does not solve the numerical precision issues, so I suspect it is not exactly the same as what you are doing.

Any clarification you can offer would be very helpful.

Hi @jbms , thank you so much for writing down what's in your mind! I believe we're mostly aligned, except for some lower level details. As you suggested, I will provide some code so you can look into it closely, but it may take some time as I mentioned. But I can try to provide the formula (similar to what you have done) ASAP.
Meanwhile, just to make sure we are on the same page, in this formula you listed above Ai(i, j, k) = A(i * ci, j, k), is i, j, or k still "a multiple of some stride ci, cj, ck (e.g. the zarr chunk size)"?

jbms · 2023-03-23T18:10:42Z

Meanwhile, just to make sure we are on the same page, in this formula you listed above Ai(i, j, k) = A(i * ci, j, k), is i, j, or k still "a multiple of some stride ci, cj, ck (e.g. the zarr chunk size)"?

Sorry, I realize my notation wasn't very clear. I updated my comment to use the notation: Ai(ic, j, k) = A(ic * ci, j, k), so that I'm not using i in two different ways, and added an example with the sizes of the different arrays.

jbms · 2023-03-23T18:20:29Z

Independent of these details, I think it is clear now that this exists as an additional layer entirely on top of zarr itself. I'm still interested in learning more about this proposal, and happy to provide further review/feedback, but per discussion at the ZEP meeting today, it appears that this proposal may not be a good fit for the ZEP process itself.

The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors. This proposal more naturally exists as a layer entirely on top of zarr itself and requires no changes to zarr implementations like zarr-python. Therefore, it may make more sense for you to publish this standard independent of the zarr specification, as an external extension, similar to OME-zarr. If in the future we have a website that links to external extensions it could be included there.

hailiangzhang · 2023-03-23T19:00:17Z

Independent of these details, I think it is clear now that this exists as an additional layer entirely on top of zarr itself. I'm still interested in learning more about this proposal, and happy to provide further review/feedback, but per discussion at the ZEP meeting today, it appears that this proposal may not be a good fit for the ZEP process itself.

The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors. This proposal more naturally exists as a layer entirely on top of zarr itself and requires no changes to zarr implementations like zarr-python. Therefore, it may make more sense for you to publish this standard independent of the zarr specification, as an external extension, similar to OME-zarr. If in the future we have a website that links to external extensions it could be included there.

Hi @jbms , yes, I was totally aware that this proposal is on top of Zarr itself. Actually I did mention this concern at some point, and was suggested to file a Zarr extension, instead of spec, proposal. Honestly I got a feeling before as well that this proposal may not be necessary to stay in Zarr core spec repo, and yes, we can host the spec as an external extension, similar to OME-zarr, and I will provide a link when it's in-place. Thanks a lot for your suggestions!

rabernat · 2023-05-17T16:30:04Z

The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors.

Apologies for coming so late to this conversation.

I am the one who encouraged @hailiangzhang to consider proposing this as an extension. I wanted to state that, to me, this does seem like a good fit for an optional extension. I see this proposal as comparable to parquet's row-group statistics. Implementations can use this information to implement more efficient queries, e.g. predicate pushdown, or, in Hailiang's case, more efficient sums / means over regions of the array.

jbms · 2023-05-17T18:43:06Z

A more precise specification of this proposal would definitely help, as the current proposal does not seem to indicate precisely what calculated values are stored.

To me, a zarr extension is something that must be implemented by the zarr implementation itself, e.g. zarr-python. In contrast, something like ome-zarr is "layered on top of zarr", and makes use of a zarr implementation but does not require any changes to the zarr implementation. In principle the same package could implement both zarr and ome-zarr together, but likely would still employ a layered architecture where there is a lower-level "zarr" layer and then an ome-zarr layer on top.

This proposal seems like it falls under the "layered on top of zarr" category:

It can, and has been, implemented without any changes to the zarr implementation.
The additional information is stored as additional regular zarr arrays and zarr user attributes.
It relates to an operation (computing certain weighted sums) that is not normally provided by the zarr implementation itself.
The additional information is, I think, computed as a batch operation after the base array has already been written. Likely this would be done using a distributed processing framework like dask, which might be outside the scope of a zarr implementation.

hailiangzhang · 2023-05-17T20:45:07Z

Thanks for your responses @rabernat and @jbms !
Actually this proposal is probably comparable to the approach used in the Xarray library, which introduces specific fields such as _ARRAY_DIMENSIONS within Zarr attributes to facilitate the implementation of the Xarray library, allowing it to provide labels and support operations on named dimensions within Zarr arrays. Similarly, this proposal introduces schemas to the data group and array attribute to describe the chunk level statistics and facilitate accumulation-based data analysis.
I remember at some point it was mentioned that the Xarray spec should have been incorporated into the Zarr specification, so I thought this proposal may be a candidate too. However, I must acknowledge that there may be some finer details or guidelines that I've overlooked. Regardless, I am comfortable proceeding in whichever direction aligns best with the existing guidelines:)
BTW, @jbms , despite this, I remember you have been asking for some source codes to help you understand some low level details. My apologies for not being able to dedicate more time to this task, but a cleaner version of library to prepare the accumulation data should be coming up very soon and will keep you updated.

briannapagan · 2023-11-20T17:35:15Z

Hi @MSanKeys963 @jbms @joshmoore @rabernat - we would like to submit this work for consideration of a The White House Office of Science & Technology Policy Open Science Recognition Challenge - https://www.challenge.gov/?challenge=ostp-year-of-open-science-recognition-challenge. Do I have your permission to list you all as collaborators? The form is rather simple, there is no formal way (i.e. adding emails etc), just wanted to at least list the names/affiliations. Also am I missing anyone else that may have helped push this forward (@hailiangzhang)?

jbms · 2023-11-20T17:46:56Z

I don't think my own involvement in reviewing this proposal is sufficient to be listed as a collaborator.

I actually would still very much welcome some clarifications about precisely how the cumulative sums are stored during the precomputation step and how they are used to compute the result of a query, so that I can better understand this proposal.

joshmoore · 2023-11-21T16:54:31Z

Hi @briannapagan. No objection to being listed, and generally happy to do what I can to recognize open science! 😄

rabernat · 2023-11-21T19:48:20Z

Same as Jeremy, I don't think I've done anything to merit authorship here. IMO lots more work needed to move this forward.

briannapagan · 2023-11-21T21:49:18Z

Thank you for the responses all! @hailiangzhang will provide some updates as there's been quite some work done since March. However, I think at this time we will skip submitting anything for recognition but might be of interest for your other zarr efforts! Cheers.

sanketverma1704 · 2023-11-30T15:50:14Z

Thanks for the update, @briannapagan.

Looking forward to the updates here.

This a bit of short notice, but it'd be great if you or @hailiangzhang could join the ZEP meeting today or the next one on (12/14) so that we can move this forward and assist you in any way possible. Thanks!

HailiangZhangNASA added 29 commits February 13, 2023 10:39

Add zarr-accumulation/v1.0.rst draft

494def1

zarr-accumulation.rst added

a31ce72

zarr-accumulation/v1.0.rst more content added

64dc873

zarr-accumulation/v1.0.rst hyperlink polished

1690563

zarr-accumulation/v1.0.rst minor cleaning

412a440

zarr-accumulation/v1.0.rst code formatting

bc022ea

zarr-accumulation/v1.0.rst code formatting

e1c4412

zarr-accumulation/v1.0.rst more content added

6887e22

zarr-accumulation/v1.0.rst code formatting

4f6eeda

zarr-accumulation/v1.0.rst more content added

63b8897

zarr-accumulation/v1.0.rst code formatting

6329a42

zarr-accumulation/v1.0.rst code formatting

b489021

zarr-accumulation/v1.0.rst code formatting

7b40c00

zarr-accumulation/v1.0.rst more content added

e76ed28

zarr-accumulation/v1.0.rst more content added

800ff95

zarr-accumulation/v1.0.rst more content added

ac27f9f

zarr-accumulation/v1.0.rst more content added

ec3ac00

zarr-accumulation/v1.0.rst code formatting

4030f36

zarr-accumulation/v1.0.rst more content added

dc1e559

zarr-accumulation/v1.0.rst code formatting

5ddad67

zarr-accumulation/v1.0.rst code formatting

4181d85

zarr-accumulation/v1.0.rst more content added

be5a3aa

more content added in Motivation

27ecba5

Detailed description added

68bf791

minor cleaning

82c2948

minor cleaning

0d2e1e7

minor cleaning

87f3d70

minor cleaning

d6da2f2

hyperlink provided

c302847

hailiangzhang mentioned this pull request Feb 14, 2023

Extension proposal for zarr accumulation zarr-developers/zeps#31

Merged

rabernat mentioned this pull request Mar 17, 2023

zarr.core.Array.chunk_store changes type after zarr=2.23.3 (after PR 1304) zarr-developers/zarr-python#1362

Closed

sanketverma1704 mentioned this pull request Mar 21, 2023

Minor changes to ZEP0000 and add a new page for ZIC zarr-developers/zeps#27

Merged

danlooo mentioned this pull request Apr 29, 2024

Add summary statistics for variables danlooo/DGGS.jl#87

Closed

DahnJ mentioned this pull request Jun 28, 2024

Incrementally-populated Zarr Arrays #300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension proposal for zarr accumulation #205

Extension proposal for zarr accumulation #205

hailiangzhang commented Feb 14, 2023 •

edited

Loading

jbms commented Feb 16, 2023

hailiangzhang commented Feb 16, 2023

jbms commented Mar 23, 2023 •

edited

Loading

hailiangzhang commented Mar 23, 2023

jbms commented Mar 23, 2023

jbms commented Mar 23, 2023

hailiangzhang commented Mar 23, 2023

rabernat commented May 17, 2023

jbms commented May 17, 2023

hailiangzhang commented May 17, 2023

briannapagan commented Nov 20, 2023

jbms commented Nov 20, 2023

joshmoore commented Nov 21, 2023

rabernat commented Nov 21, 2023 •

edited

Loading

briannapagan commented Nov 21, 2023

sanketverma1704 commented Nov 30, 2023 •

edited

Loading

Extension proposal for zarr accumulation #205

Are you sure you want to change the base?

Extension proposal for zarr accumulation #205

Conversation

hailiangzhang commented Feb 14, 2023 • edited Loading

jbms commented Feb 16, 2023

hailiangzhang commented Feb 16, 2023

jbms commented Mar 23, 2023 • edited Loading

hailiangzhang commented Mar 23, 2023

jbms commented Mar 23, 2023

jbms commented Mar 23, 2023

hailiangzhang commented Mar 23, 2023

rabernat commented May 17, 2023

jbms commented May 17, 2023

hailiangzhang commented May 17, 2023

briannapagan commented Nov 20, 2023

jbms commented Nov 20, 2023

joshmoore commented Nov 21, 2023

rabernat commented Nov 21, 2023 • edited Loading

briannapagan commented Nov 21, 2023

sanketverma1704 commented Nov 30, 2023 • edited Loading

hailiangzhang commented Feb 14, 2023 •

edited

Loading

jbms commented Mar 23, 2023 •

edited

Loading

rabernat commented Nov 21, 2023 •

edited

Loading

sanketverma1704 commented Nov 30, 2023 •

edited

Loading