-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension proposal for zarr accumulation #205
base: main
Are you sure you want to change the base?
Extension proposal for zarr accumulation #205
Conversation
Thanks for this proposal. It seems like it may make sense to specify the stride in units of elements rather than chunks. Then this would more generally also cover downsampling. |
Thanks for your feedback @jbms ! Yes, it will be great that we can make this approach more flexible by specifying the stride in units of elements rather than chunks, but I think it probably okay to save the aggregation statistics at the chunk level since:
|
@hailiangzhang As I mentioned at the meeting today, it would be very helpful if you can provide some mathematical formulas or pseudocode that describes which cumulative sums you are storing. Your proposal describes the attribute schema but doesn't seem to describe exactly which sums are actually being stored, or how to compute the weighted sums based on the stored values. It occurs to me that the following strategy would accomplish the goals of your proposal and might be similar to what you are doing:
Any clarification you can offer would be very helpful. |
Hi @jbms , thank you so much for writing down what's in your mind! I believe we're mostly aligned, except for some lower level details. As you suggested, I will provide some code so you can look into it closely, but it may take some time as I mentioned. But I can try to provide the formula (similar to what you have done) ASAP. |
Sorry, I realize my notation wasn't very clear. I updated my comment to use the notation: |
Independent of these details, I think it is clear now that this exists as an additional layer entirely on top of zarr itself. I'm still interested in learning more about this proposal, and happy to provide further review/feedback, but per discussion at the ZEP meeting today, it appears that this proposal may not be a good fit for the ZEP process itself. The ZEP process is intended for changes to the zarr specification itself, and is intended to gather feedback from zarr implementors. This proposal more naturally exists as a layer entirely on top of zarr itself and requires no changes to zarr implementations like zarr-python. Therefore, it may make more sense for you to publish this standard independent of the zarr specification, as an external extension, similar to OME-zarr. If in the future we have a website that links to external extensions it could be included there. |
Hi @jbms , yes, I was totally aware that this proposal is on top of Zarr itself. Actually I did mention this concern at some point, and was suggested to file a Zarr extension, instead of spec, proposal. Honestly I got a feeling before as well that this proposal may not be necessary to stay in Zarr core spec repo, and yes, we can host the spec as an external extension, similar to OME-zarr, and I will provide a link when it's in-place. Thanks a lot for your suggestions! |
Apologies for coming so late to this conversation. I am the one who encouraged @hailiangzhang to consider proposing this as an extension. I wanted to state that, to me, this does seem like a good fit for an optional extension. I see this proposal as comparable to parquet's row-group statistics. Implementations can use this information to implement more efficient queries, e.g. predicate pushdown, or, in Hailiang's case, more efficient sums / means over regions of the array. |
A more precise specification of this proposal would definitely help, as the current proposal does not seem to indicate precisely what calculated values are stored. To me, a zarr extension is something that must be implemented by the zarr implementation itself, e.g. zarr-python. In contrast, something like ome-zarr is "layered on top of zarr", and makes use of a zarr implementation but does not require any changes to the zarr implementation. In principle the same package could implement both zarr and ome-zarr together, but likely would still employ a layered architecture where there is a lower-level "zarr" layer and then an ome-zarr layer on top. This proposal seems like it falls under the "layered on top of zarr" category:
|
Thanks for your responses @rabernat and @jbms ! |
Hi @MSanKeys963 @jbms @joshmoore @rabernat - we would like to submit this work for consideration of a The White House Office of Science & Technology Policy Open Science Recognition Challenge - https://www.challenge.gov/?challenge=ostp-year-of-open-science-recognition-challenge. Do I have your permission to list you all as collaborators? The form is rather simple, there is no formal way (i.e. adding emails etc), just wanted to at least list the names/affiliations. Also am I missing anyone else that may have helped push this forward (@hailiangzhang)? |
I don't think my own involvement in reviewing this proposal is sufficient to be listed as a collaborator. I actually would still very much welcome some clarifications about precisely how the cumulative sums are stored during the precomputation step and how they are used to compute the result of a query, so that I can better understand this proposal. |
Hi @briannapagan. No objection to being listed, and generally happy to do what I can to recognize open science! 😄 |
Same as Jeremy, I don't think I've done anything to merit authorship here. IMO lots more work needed to move this forward. |
Thank you for the responses all! @hailiangzhang will provide some updates as there's been quite some work done since March. However, I think at this time we will skip submitting anything for recognition but might be of interest for your other zarr efforts! Cheers. |
Thanks for the update, @briannapagan. Looking forward to the updates here. This a bit of short notice, but it'd be great if you or @hailiangzhang could join the ZEP meeting today or the next one on (12/14) so that we can move this forward and assist you in any way possible. Thanks! |
This PR contains the Zarr extension proposal for a method that we developed at NASA GESDISC -- Zarr-based Chunk-level Accumulation, which provides fast and cost-efficient data analysis services based on chunk level statistics. The proposal mainly includes the Zarr group structure and attribute schema for the chunk level statistics.
This related PR against zeps repo is here.