Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP0001 Review #227

Closed
jbms opened this issue Apr 6, 2023 · 48 comments
Closed

ZEP0001 Review #227

jbms opened this issue Apr 6, 2023 · 48 comments

Comments

@jbms
Copy link
Contributor

jbms commented Apr 6, 2023

Hi everyone,

I hope you’re doing well.

After working continuously for several months, the ZEP0001 - V3 Specification is ready for review by you, the Zarr implementations! 🎉 The goal of this process is to move towards supporting the V3 specification in as many implementation as possible.

There have been significant changes to the V3 since it was proposed in July 2022. We have maintained a GitHub project board to keep track of the progress. We’re pleased to confirm that all crucial issues have been marked completed after holding several ZEP meetings.

Let me review how the process will work. We have created this issue to track the approvals from the ZSC, ZIC and the broader Zarr community.
Specific technical feedback about the specification should be made via narrowly scoped issues on the Zarr specs repo that link to this issue.

Now, according to the section, ‘How does a ZEP becomes accepted’ - ZEP0000, a ZEP must satisfy three conditions for approval:

  • Unanimous approval of the Zarr Steering Council
  • Majority approval of the Zarr Implementations Council
  • And, no vetos from the Zarr Implementations Council

As an implementation council member, you have three options for your vote:

  • YES - Approve the V3 Spec. This indicates that your implementation intends to implement support for the V3 core spec (but not necessarily any extensions).
  • ABSTAIN - This would be appropriate if you do not intend to implement the V3 spec.
  • VETO - As as ZIC member, you have the ability to veto a ZEP, including this one. However, stongly wish to avoid this situation for V3, given its foundational nature for the future of Zarr.

We request you, the ZIC, review the new version of the Zarr specification and let us know your thoughts. We’ve listed some steps to read and understand the new specification version more quickly. They are as follows:

  • Optional but recommended: Have a cursory read of the Zarr V2 Specification
  • After this, go through the section ‘Comparison with Zarr V2’ in the new Specification
  • Once done, go through Core V3 Specification
    • Please thoroughly read the new Specification; if there are any issues, we’ll address them ASAP!
    • One of the most critical sections in the V3 is how the metadata is organised; please have a look here
  • Gather your thoughts
  • Optional but recommended: begin implementing the spec experimentally.
  • Cast your vote by dropping a comment on the issue here

We understand that the tasks mentioned above take time, accounting for the daily work life of the council members. Therefore, we’d greatly appreciate it if you could cast your vote by no later than 6th May 2023, 23:59:59 AoE.

And that’s it! Once we’re past the voting phase, the V3 Specification will be officially adopted!
Feel free to check out an experimental Python implementation of V3 here.

Please let us know if there are any questions. Thank you for your time.

@zarr-developers/implementation-council

@jbms jbms pinned this issue Apr 6, 2023
@clbarnes
Copy link
Contributor

clbarnes commented Apr 18, 2023

Is the version of zarrita linked above intended to be compliant with the spec linked above? The way that sharding is handled seems slightly different in a way which is confusing to me (although not necessarily anyone else 😅 ). Is the intention that the array metadata specifies the shape of chunks (of which several make up a shard) and the sharding transformer specifies the number of shards within it? So the array would make a request to its transformers of, say, chunk (2, 2); if there were no transformers, that request would go to the store, but if there was a sharding transformer with chunks_per_shard (2,2), then that transformer would make a partial read request to the store for the end of shard (1, 1) to find where the shard's chunk (2, 2) was and then read that? As an aside, does that mean the dumb store without a partial read implementation would do 2 full shard reads (possibly over a network, if it didn't have a caching layer)? I suppose there's no point in sharding if you don't have a partial read implementation.

I'm also being thrown off a bit as I think some of the explanatory text in the sharding transformer is out of sync with some other parts of the spec - it refers to the /data/root key prefix which I think was deprecated, and doesn't include a dimension separator between the c and the first chunk index.

The other thing which isn't clear to me is what codecs ingest and produce - some more explanation here would be useful. The compression codecs seem to go bytes -> bytes, the endianness codec goes value -> value (as it's an optional codec, it can't be the thing which goes bytes -> value), and the transpose codec goes ndarray -> ndarray. How does an implementation know when to interrupt the codec chain to convert bytes to linear values and linear values to an ndarray? The explanation which works in my head is that every codec ingests and produces a tuple of (shape, order, dtype, endianness, bytes), and the codecs mutate any of those fields, only materialising the bytes -> values -> ndarray once they're all done? This might incur some unnecessary roundtripping for e.g. compressors which need the actual values but I imagine that's fairly quick compared to IO.

Apologies if I've missed the answers to any of these questions in previous discussions!

@jbms
Copy link
Contributor Author

jbms commented Apr 18, 2023

Thanks for your review.

Is the version of zarrita linked above intended to be compliant with the spec linked above? The way that sharding is handled seems slightly different in a way which is confusing to me (although not necessarily anyone else sweat_smile ). Is the intention that the array metadata specifies the shape of chunks (of which several make up a shard) and the sharding transformer specifies the number of shards within it?

To be clear, this review does not cover sharding --- there will be a separate review for ZEP2. However, the way sharding fits in with the base v3 spec is certainly relevant.

The sharding proposal has changed recently --- see the updated version here:
https://github.com/zarr-developers/zarr-specs/blob/main/docs/v3/codecs/sharding-indexed/v1.0.rst
I apologize for the delay in merging #223.

So the array would make a request to its transformers of, say, chunk (2, 2); if there were no transformers, that request would go to the store, but if there was a sharding transformer with chunks_per_shard (2,2), then that transformer would make a partial read request to the store for the end of shard (1, 1) to find where the shard's chunk (2, 2) was and then read that? As an aside, does that mean the dumb store without a partial read implementation would do 2 full shard reads (possibly over a network, if it didn't have a caching layer)? I suppose there's no point in sharding if you don't have a partial read implementation.

The new sharding specification is defined a bit differently but ultimately works the same in terms of I/O. Yes, sharding was designed specifically for cases where partial I/O is supported. Without partial I/O it would still allow you to decode just a single sub-chunk, but that isn't likely to be particularly useful.

I'm also being thrown off a bit as I think some of the explanatory text in the sharding transformer is out of sync with some other parts of the spec - it refers to the /data/root key prefix which I think was deprecated, and doesn't include a dimension separator between the c and the first chunk index.

That should be fixed in the updated sharding specification.

The other thing which isn't clear to me is what codecs ingest and produce - some more explanation here would be useful. The compression codecs seem to go bytes -> bytes, the endianness codec goes value -> value (as it's an optional codec, it can't be the thing which goes bytes -> value), and the transpose codec goes ndarray -> ndarray. How does an implementation know when to interrupt the codec chain to convert bytes to linear values and linear values to an ndarray? The explanation which works in my head is that every codec ingests and produces a tuple of (shape, order, dtype, endianness, bytes), and the codecs mutate any of those fields, only materialising the bytes -> values -> ndarray once they're all done? This might incur some unnecessary roundtripping for e.g. compressors which need the actual values but I imagine that's fairly quick compared to IO.

This section attempts to explain the process:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#conversion-between-multi-dimensional-array-and-byte-string-representations

This PR (which needs to be rebased) tries to clarify the language:
https://github.com/zarr-developers/zarr-specs/pull/222/files

But we do need to clarify for each codec whether it is "bytes -> bytes" or "array -> bytes" or "array -> array".

As you noted, transpose is "array -> array".
endian is "array -> bytes".
gzip and blosc are "bytes -> bytes".

Essentially, when writing (i.e. encoding):

We can think of the codec as a simple function that maps:

(bytestring, element_size) -> (bytestring, element_size);
array -> array; or
array -> (bytestring, element_size)

The reason we have to associate an element_size with a bytestring is to accommodate blosc (see #225 (comment)), since it supports a shuffle option that depends on the "element size".

If we are trying to apply a codec that expects a bytestring as input, but we still have an array, then we convert the array to a (bytestring, element_size) pair by using the default binary encoding of the data type (little endian for all currently defined data types), in lexicographical (row major) order,. This is equivalent to inserting {"name": "endian", "endian": "little"} as an additional codec in between. For variable-length string data types to be added in the future, where there may not be a default binary encoding, it will likely be an error to have a bytes -> bytes codec without a preceding array -> bytes codec. Instead an array -> bytes codec for use with such data types will need to also be defined.

The spec only deals with "logical" arrays, that have a logical data type and shape. The physical layout in memory is up to the implementation. In practice I expect implementations to use a representation similar to numpy, that allows any arbitrary memory order, so that transpose can be implemented as a lazy operation that just permutes the shape and recorded memory order without actually transposing the data. Some implementation may also choose to allow both little endian and big endian representations in memory, so that the endian codec can also be a lazy operation that does not actually convert the in-memory representation. Other implementations may always store arrays in-memory with native endianness, in which case if the endian codec specifies a non-native endianness, byte swapping would actually be required.

Implementations may also wish to use a streaming interface for handling the bytestring input/output of a codec.

When reading, we have the additional complication of supporting partial I/O:

Logically an array -> array codec receives as input a "lazy array" interface provided by the next codec in the chain, and returns a "lazy array" interface representing the decoded array. The "lazy array" interface should provide operations to read portions of an array, similar to the array interface provided by zarr itself. For example, the lazy array returned by the "transpose" codec would need to translate read requests that it receives into transposed read requests, perform them on the lazy array provided by the next codec in the chain, and then return the result after transposing it.

An array -> bytes codec receives as input a "file handle" interface provided by the next codec in the chain (or the store/storage transformer itself), and returns a "lazy array" interface representing the decoded array. For example, the "endian" codec, if it supported partial I/O, could translate a request to read a region of the array into a request for a particular byte range, or set of byte ranges, on the "file handle". Then after receiving the response it can perform the endianness conversion.

A bytes -> bytes codec receives as input a "file handle" interface provided by the next codec in the chain, and returns a "file handle" interface representing the decoded data.

Codecs don't have to support partial I/O, and in practice most implementations probably won't support partial I/O except for specific codecs like the sharding codec. A codec that does not support partial I/O can simply immediately read the entire array or file handle it receives from the next codec in the chain, and then decode the result into an in-memory byte string or array. Finally, it can return a lazy file handle / lazy array interface that simply accesses to the in-memory byte string or array.

@clbarnes
Copy link
Contributor

clbarnes commented Apr 19, 2023

Thank you for the explanation and for merging the PRs to update the spec! That does make more sense.

As you say, sharding does seem to impact the design of the core spec so forgive me if I keep asking questions about it! And also apologies for not getting involved in the discussion earlier.

  • Now that the array's chunk_shape refers to the shard shape and the shard codec config controls the size of inner chunks, how does an array request that only a particular inner chunk is read and decoded?
  • In the binary shard format, indices as a suffix suggests store implementations also need to be able to get the size of the object. For a remote object store, that may mean 3 requests to get an inner chunk - one to find the length, one to read the footer, and one to actually read the data. Or is the intention to always download the entire shard? I guess not if partial-read stores are still being discussed. I know there's a trade-off here between append-only stores, request round-trips, download volume, and decompression volume, just trying to get my head around the design in e.g. rust where readable buffers are not necessarily seekable, and seeking forward may require downloading the whole payload anyway.

With the inclusion of sharding, "codec" seems to cover half a dozen different APIs (bytes -> bytes, array -> bytes, array -> array, plus partial variants which in the array case need to cover relatively complex slicing logic), where the partials need to be able to pass through each other's arguments and so on. This is more of a problem in stricter languages (like rust) where those interfaces need to be quite explicit.

@clbarnes
Copy link
Contributor

Additionally, for the blosc codec, rather than giving a shuffle mode of -1 and then having to pass that and the item size down the codec chain, could we instead just store the concrete shuffle mode, and rather have clients which add codecs to an array convert a -1 into whatever concrete mode is appropriate? This still means that the codec constructor needs to know the array context, but it's a lot easier to pass it in there than to change the whole codec API to support a serialisation format which doesn't need that flexibility.

@normanrz
Copy link
Member

normanrz commented Apr 19, 2023

  • In the binary shard format, indices as a suffix suggests store implementations also need to be able to get the size of the object. For a remote object store, that may mean 3 requests to get an inner chunk - one to find the length, one to read the footer, and one to actually read the data. Or is the intention to always download the entire shard? I guess not if partial-read stores are still being discussed. I know there's a trade-off here between append-only stores, request round-trips, download volume, and decompression volume, just trying to get my head around the design in e.g. rust where readable buffers are not necessarily seekable, and seeking forward may require downloading the whole payload anyway.

Servers that support HTTP range requests typically allow you to request a suffix with a specified length. So, you could request just the last N bytes from the server. This includes the main cloud storage systems such as S3 and GCS. Since the shard index size can be determined just by looking at the json metadata, you're down to 2 requests to get an inner chunk.

@jbms
Copy link
Contributor Author

jbms commented Apr 20, 2023

Thank you for the explanation and for merging the PRs to update the spec! That does make more sense.

As you say, sharding does seem to impact the design of the core spec so forgive me if I keep asking questions about it! And also apologies for not getting involved in the discussion earlier.

  • Now that the array's chunk_shape refers to the shard shape and the shard codec config controls the size of inner chunks, how does an array request that only a particular inner chunk is read and decoded?

Probably the array should pass in some description of the region that is being requested, e.g. a bounding box, or perhaps something analogous to a numpy selection tuple or a tensorstore index transform.

  • In the binary shard format, indices as a suffix suggests store implementations also need to be able to get the size of the object. For a remote object store, that may mean 3 requests to get an inner chunk - one to find the length, one to read the footer, and one to actually read the data. Or is the intention to always download the entire shard? I guess not if partial-read stores are still being discussed. I know there's a trade-off here between append-only stores, request round-trips, download volume, and decompression volume, just trying to get my head around the design in e.g. rust where readable buffers are not necessarily seekable, and seeking forward may require downloading the whole payload anyway.

Definitely the intention is not to require always downloading the entire shard --- that would defeat the whole purpose of sharding. I'd suggest that your "store" abstraction provide a way to request specific byte ranges, and as part of that byte range request interface, allow you to request the last N bytes without knowing the total size, as is supported by HTTP/S3/GCS. You might also want to make the I/O async.

With the inclusion of sharding, "codec" seems to cover half a dozen different APIs (bytes -> bytes, array -> bytes, array -> array, plus partial variants which in the array case need to cover relatively complex slicing logic), where the partials need to be able to pass through each other's arguments and so on. This is more of a problem in stricter languages (like rust) where those interfaces need to be quite explicit.

Yes, you can think of it as 3 types of codecs --- you have to apply the "array -> array" codecs, then at most one "array -> bytes" codec, then the "bytes -> bytes" codecs. If there is no "array -> bytes" codec specified, then you can use the default one implied by the data type, which is {"name": "endian", "endian": "little"} for all the current data types. So you could have a list of "array -> array" codecs, a single "array -> bytes" codec, and a list of "bytes -> bytes" codecs.

As far as the distinction between all-at-once and partial I/O codecs, you could, for each of the 3 codec types, have one all-at-once interface and one partial I/O interface, and create an adapter for each of the 3 types that implements the partial I/O interface in terms of the all-at-once interface. That way when applying the codecs you only need to worry about the partial I/O interface. Or you might find that you don't have any "bytes -> bytes" codecs for which you want to support partial I/O, and therefore don't need to bother defining a partial I/O interface for "bytes -> bytes" codecs.

I think as part of the implementation you will definitely want an in-memory array type that allows the data type, number of dimensions, and memory layout to all be determined at runtime. Using static typing for any of those 3 things isn't going to work very well since an "array -> array" codec might use any intermediate data type.

@jbms
Copy link
Contributor Author

jbms commented Apr 20, 2023

Additionally, for the blosc codec, rather than giving a shuffle mode of -1 and then having to pass that and the item size down the codec chain, could we instead just store the concrete shuffle mode, and rather have clients which add codecs to an array convert a -1 into whatever concrete mode is appropriate? This still means that the codec constructor needs to know the array context, but it's a lot easier to pass it in there than to change the whole codec API to support a serialisation format which doesn't need that flexibility.

Yes, that is indeed what I originally proposed in #225

Perhaps we can continue the discussion in that PR.

@clbarnes
Copy link
Contributor

clbarnes commented Apr 20, 2023

Probably the array should pass in some description of the region that is being requested, e.g. a bounding box, or perhaps something analogous to a numpy selection tuple or a tensorstore index transform.

So here the Array class can't treat the codec chain as a black box which reads from the store and returns a chunk's worth of decoded data - the Array needs to know what each of the codecs are so that it can decide whether to pass any region descriptions into any of them, which may be affected by what order they're defined and which other codecs are present. So the Array will end up very tightly bound to codec implementations, including how they combine with each other (e.g. keeping track of offsets between different codecs supporting region descriptions) and how to map from which voxels it needs into which regions it needs from each codec.

I guess my concern is that if sharding is going to slot into a generic extension point (as a codec or storage transformer), then zarr implementations need to be ready for any number of extensions which (ab)use the gates opened to allow sharding to fit. If other extensions ought not to use those features, sharding would need to be its own extension point, even if it's really a few existing extension points sharing a trenchcoat.

@jbms
Copy link
Contributor Author

jbms commented Apr 20, 2023

Probably the array should pass in some description of the region that is being requested, e.g. a bounding box, or perhaps something analogous to a numpy selection tuple or a tensorstore index transform.

So here the Array class can't treat the codec chain as a black box which reads from the store and returns a chunk's worth of decoded data - the Array needs to know what each of the codecs are so that it can decide whether to pass any region descriptions into any of them, which may be affected by what order they're defined and which other codecs are present. So the Array will end up very tightly bound to codec implementations, including how they combine with each other (e.g. keeping track of offsets between different codecs supporting region descriptions) and how to map from which voxels it needs into which regions it needs from each codec.

It is true that the codec interface is definitely more complicated, but I think it can nonetheless still be an abstraction layer.

For example, let's say that our Array class will support reading arbitrary rectangular regions. Then we might define the following interfaces, using Rust pseudo-syntax (probably a lot of details wrong here, as I am not super familiar with Rust, but I think this conveys the idea):

trait ArrayReader {
  fn read(&self, region: &Region) -> Result<NDArray, Error>;
}

trait ArrayToArrayCodec {
  fn decode(&self, source: Box<dyn ArrayReader>) -> Result<Box<impl ArrayReader>, Error>;
  fn encode(&self, source: NDArray) -> Result<NDArray, Error>;
}

trait ArrayToBytesCodec {
  fn decode(&self, source: Box<dyn BytesReader>) -> Result<Box<impl ArrayReader>, Error>;
  fn encode(&self, source: NDArray) -> Result<Vec<u8>, Error>;
}

trait BytesToBytesCodec {
  fn decode(&self, source: Box<dyn BytesReader>) -> Result<Box<impl BytesReader>, Error>;
  fn encode(&self, source: Vec<u8>) -> Result<Vec<u8>, Error>;
}

trait SimpleArrayToArrayCodec {
  fn encode(&self, source: NDArray) -> Result<NDArray, Error>;
  fn decode(&self, source: NDArray) -> Result<NDArray, Error>;
}

impl ArrayToArrayCodec for SimpleArrayToArrayCodec {
  fn decode(&self, source: Box<dyn ArrayReader>) -> Result<Box<impl ArrayReader>, Error> {
    Ok(box SimpleArrayReader(SimpleArrayToArrayCodec::decode(source.read(Region::full())?)))
  }
    
  fn encode(&self, source: NDArray) -> Result<NDArray, Error> {
    SimpleArrayToArrayCodec::encode(self, source)
  }
}

struct SimpleArrayReader(NDArray);

impl ArrayReader for SimpleArrayReader {
  fn read(&self, region: &Region) -> Result<NDArray, Error> {
    self.0.get_region(region)
  }
}

struct TransposeCodec {
  permutation: Vec<i32>;
}

struct TransposeCodecArrayReader {
  permutation: Vec<i32>;
  inv_permutation: Vec<i32>;
  source: Box<dyn ArrayReader>;
}

impl ArrayReader for TransposeCodecArrayReader {
  fn read(&self, region: &Region) -> Result<NDArray,Error> {
    self.source.read(region.transpose(self.inv_permutation)?)?.transpose(self.permutation)?
  }
}

impl ArrayToArrayCodec for TransposeCodec {
  fn decode(&self, source: Box<dyn ArrayReader>) -> Result<Box<dyn ArrayReader>, Error> {
    Ok(TransposeCodecArrayReader{permutation: &self.permtuation, inv_permutation: invert_permutation(&self.permutation)?, source: source})
  }

  fn encode(&self, source: NDArray) -> Result<NDArray, Error> {
    self.source.transpose(&self.permutation)
  }
}

@normanrz
Copy link
Member

Similar to what @jbms just posted, in zarrita instead of passing arrays or bytes to the codecs, I pass a value handle (like the Bytes/ArrayReader from above) that holds a reference to a memory buffer, file or object and provides a uniform interface for reading and writing. Essentially, this defers the IO operation until all the details are known (e.g. byte ranges from codecs). This allowed me to make the codec abstraction work cleanly (see https://github.com/scalableminds/zarrita/blob/v3/zarrita/array.py#L265-L267).

@constantinpape
Copy link

First of all it's great to see all this effort going into the V3 specification and I think this is a great improvement over V2.
I fully support the adoption of this, and the minor comments I had on this a while back have been addressed.

Officially I will vote with ABSTAIN, since I don't have the capacity to implement the changes in z5/z5py at this point.

@MSanKeys963
Copy link
Member

Hi everyone.

I just wanted to re-surface the deadline for the ZEP0001 review, i.e. 6th May 2023, 23:59 AoE. Please go through the issue and vote for the current draft of V3 Spec.

If you have any questions/issues, feel free to ask them here or create a new issue as mentioned in the description of this issue,
We are looking forward to your feedback. Thanks!

CC: @zarr-developers/implementation-council

@jbms
Copy link
Contributor Author

jbms commented May 2, 2023

I vote YES for tensorstore and Neuroglancer.

@DennisHeimbigner
Copy link

I should note that the above proposed interface to codecs is painfully difficult to implement
for C.

@DennisHeimbigner
Copy link

This statement in the V3 spec is a really bad idea:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

It completely violates the "read-broadly, write-narrowly" principle. Instead, it should read something like this:

The array metadata object may contain other names than those listed above. If an implementation does not understand the additional names, it should ignore them. It is up to the definer of those other names to ensure that ignoring them will still allow other implementations to use that file.

@jbms
Copy link
Contributor Author

jbms commented May 2, 2023

This statement in the V3 spec is a really bad idea:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

It completely violates the "read-broadly, write-narrowly" principle. Instead, it should read something like this:

The array metadata object may contain other names than those listed above. If an implementation does not understand the additional names, it should ignore them. It is up to the definer of those other names to ensure that ignoring them will still allow other implementations to use that file.

The intended purpose of that statement is to allow spec evolution as follows. For example, say we wish to add support for a non-zero origin of the coordinate space, or some other coordinate space transformation, as an optional feature in the future. Implementations that don't support this feature should fail when trying to open an array that uses this feature. With the current restriction that implementations should fail if they encounter an unknown field, we can simply add an origin field to the array metadata, and implementations that don't support it will fail as intended.

If we instead say that unknown fields are ignored, we will need some other way to cause implementations that don't support the feature to fail. There are various options:

  • Change the zarr_format value.
  • Change or remove some other existing required member. For example, we could eliminate the shape member, and instead introduce an exclusive_max member.
  • Introduce a different mechanism, like "required_features": ["origin"], to tell implementations to fail.

Any of these options is potentially valid. The drawback to incrementing zarr_format is that it essentially requires a linear order for spec evolution. In order for an implementation to support any optional feature, it must at least be aware of every previously-introduced optional feature so that it can properly fail if that feature is in use.

Introducing a separate required_features member would be fine, but it seemed reasonable to instead just encode whether a given attribute indicates a required feature or not, based on whether it has a nested "must_understand" member.

I assume that you are worried about adding optional features that implementations don't need to know about. What do you see as the problem with using {"my_optional_feature": {"must_understand": False, ...}} as is proposed by the current spec?

@DennisHeimbigner
Copy link

But the must-understand feature solves this problem, right? If the implementation encounters
and unknown name, and there is no must-understand, then it is ok to fail. But if there is no must-understand,
then ignoring the name should be the right thing to do.

@jbms
Copy link
Contributor Author

jbms commented May 2, 2023

But the must-understand feature solves this problem, right? If the implementation encounters and unknown name, and there is no must-understand, then it is ok to fail. But if there is no must-understand, then ignoring the name should be the right thing to do.

Yes, that is right, except that currently must_understand defaults to true. The biggest impact of the choice is that in order to specify must_understand the member must be an object member, so with the current spec, any must_understand=false members must be objects, and any non-object members are necessarily must_understand=true.
We imagined adding in the future more fine-grained options, like must_understand_for_reading and must_understand_for_writing. With the current default of must_understand=true, we could do that without duplicating the plain must_understand=true, while if must_understand defaults to false, then we would have to add must_understand=true, must_understand_for_reading=false or something like that.

The use of must_understand as a nested member does introduce one other unfortunate limitation: we can't add any new members of object type where the keys are arbitrary user-defined strings, since we would need to exclude "must_understand" as a valid string.

@DennisHeimbigner
Copy link

You are making this way too complex. I believe you should change the default for must-understand to false.

@jbms
Copy link
Contributor Author

jbms commented May 2, 2023

I should note that the above proposed interface to codecs is painfully difficult to implement for C.

I definitely agree that the codec interface is pretty complicated if you want to support partial reading, but are there issues specific to C?

If we want to allow partial reading and hierarchical chunking such as sharding I think the complexity may be unavoiable.

@DennisHeimbigner
Copy link

Actually the issue isn't C per-se, but rather compiled vs interpreter languages.

@jbms
Copy link
Contributor Author

jbms commented May 2, 2023

Actually the issue isn't C per-se, but rather compiled vs interpreter languages.

Is that in regards to the fact that there are 3 different types of codecs (array -> array, array -> bytes, and bytes -> bytes)? I haven't yet implemented the spec in tensorstore, but I plan to parse the codecs list into a list of array -> array codecs, an array -> bytes codec, and a list of bytes -> bytes codecs. At that point they can be used in sequence without needing the equivalent of an algebraic sum type.

@clbarnes
Copy link
Contributor

clbarnes commented May 3, 2023

I plan to parse the codecs list into a list of array -> array codecs, an array -> bytes codec, and a list of bytes -> bytes codecs

This is what I've tried to do in rust; raising an error if they're defined in the wrong order and inserting the default endian codec if it's not there. However, I do start from the algebraic sum type representation because that makes it easier to deserialise.

If we want to allow partial reading and hierarchical chunking such as sharding I think the complexity may be unavoiable.

The storage transformer specification for sharding (i.e. super-chunks and chunks, where the codec implementation is chunks and sub-chunks) seems simpler to me, at least in terms of the additional features it requires all codecs to implement. However, that could easily be because I've spent less time thinking about/ trying to implement it. I appreciate it leaves some possible features on the table, though, based on previous discussions.

@clbarnes
Copy link
Contributor

clbarnes commented May 3, 2023

Another question - where is caching expected to happen? Codecs which implement partial_decode in terms of decode must cache somewhere outside of the partial_read implementation. The sharding codec in particular could grind IO to a halt without caching, depending on the codecs it's surrounded by. Does this mean every codec needs its own cache, because it can't trust other codecs' implementation of partial_read?

Is the idea that sharding codecs' encode always writes the whole shard, too, as there's no partial_encode?

@normanrz
Copy link
Member

normanrz commented May 3, 2023

Is the idea that sharding codecs' encode always writes the whole shard, too, as there's no partial_encode?

The general idea is that shards are the unit of writing. However, if the underlying storage permits partial writes and the shard file contains enough space to hold the chunks to write, partial writes might be feasible.

@normanrz
Copy link
Member

normanrz commented May 3, 2023

The sharding codec in particular could grind IO to a halt without caching, depending on the codecs it's surrounded by.

There are multiple ways of solving the issue of IO overload. In zarrita, I am planning to coalesce requests in the storage layer. If multiple byte ranges are requested at the same time (or within a short time window), they are coalesced into one request. I believe fsspec has a similar mechanism. Reading full shards and caching them is another way of solving this issue. Some applications may actually benefit from the increased parallelization of reads against the IO layer. It is up to the implementations to evaluate these tradeoffs and make internal design decisions.

@jbms
Copy link
Contributor Author

jbms commented May 3, 2023

Another question - where is caching expected to happen? Codecs which implement partial_decode in terms of decode must cache somewhere outside of the partial_read implementation. The sharding codec in particular could grind IO to a halt without caching, depending on the codecs it's surrounded by. Does this mean every codec needs its own cache, because it can't trust other codecs' implementation of partial_read?

In general, for caching reads, I'd say you want to cache at the finest granularity at which partial I/O is supported. At the same time, you want to cache the decoded arrays, after performing any non-trivial decoding steps, so that read requests can be satisfied from the cache directly without any further decoding.

For example, if we have:

{...,
"codecs": [
  {"name": "transpose", "order": [2,0,1]},
  {"name": "sharding_indexed",
   "chunk_shape": [100, 100, 100], 
   "codecs": [{"name": "transpose", "order": [2, 1, 0]}, {"name": "gzip", "level": 5}]
  }
 ]
}

Then you'd want to cache at the level of the individual decoded gzip sub-chunks within each shard, and also cache the shard indices. If the transpose is done virtually with no actual transpose of the in-memory representation, then there is no need to cache the result of the transpose, because it can be handled while reading at zero cost. If transpose is implemented as a physical transpose in memory, then it would be better to cache the decoded inner transpose result rather than the decoded inner gzip result, but supporting partial I/O through the outer transpose would then be problematic.

@WardF
Copy link

WardF commented May 5, 2023

Registering the Unidata vote of YES on ZEP0001, with some PR's with minor clarifying language suggestions to follow shortly.

@clbarnes
Copy link
Contributor

clbarnes commented May 5, 2023

I'll admit I'm still struggling with what seems like a huge step up in complexity required by sharding, passing partial reads through unpredictable codec chains etc., but not enough to block the great work by everyone involved in pushing this over the line. I'll take your collective word that it's feasible and beg your forgiveness if I can't drive a rust implementation to completion any time soon. I am still figuring things out!

@grlee77
Copy link
Contributor

grlee77 commented May 6, 2023

I vote YES for Zarr-Python

@jstriebel
Copy link
Member

I'll admit I'm still struggling with what seems like a huge step up in complexity required by sharding, passing partial reads through unpredictable codec chains etc., but not enough to block the great work by everyone involved in pushing this over the line. I'll take your collective word that it's feasible and beg your forgiveness if I can't drive a rust implementation to completion any time soon. I am still figuring things out!

Thanks for the general support! Just for clarification: ZEP 1 and the v3 core spec don't require implementations to support partial reads or writes, and also no sharding. Sharding is being introduced in ZEP 2 and I'd recommend to move larger discussions of it to the respective review in #152.

A v3 compatible implementation does not need to support sharding necessarily. (However, it would of course be nice to consider this in the future.) The same holds for other extensions which might come. The focus of v3 is to specifically describe a common denominator (v3 core / ZEP 1), and at the same time allow extensibility, so that each implementation can define the extensions they support, and it's clear for written data which extension it relies on.

If you see room for making any of those points more clear (either in the ZEP, or in the spec), or reducing complexity anywhere, please feel free to suggest changes (also after the vote, especially if something comes up during implementation).

@aschampion
Copy link

aschampion commented May 6, 2023 via email

@jstriebel jstriebel added this to ZEP1 May 6, 2023
@jstriebel jstriebel moved this to Meta in ZEP1 May 6, 2023
@meggart
Copy link
Member

meggart commented May 8, 2023

Thanks a lot for preparing this and taking into account the concerns that were raised also from the Julia (1-based and column-major). I vote YES for JuliaIO/Zarr.jl.

@andersy005
Copy link
Member

apologies for casting the vote later than the stated deadline. I vote ABSTAIN for zarr-js

@rouault
Copy link
Contributor

rouault commented May 8, 2023

FYI GDAL master has just been updated to latest Zarr V3 spec (sharding not implement, which will be quite tricky to implement due to not being a regular "black box" codec) per OSGeo/gdal#7706

@alimanfoo
Copy link
Member

Hi all, sincere congratulations for all the work on ZEP1, it's very exciting to be reaching this point. As a member of the ZSC I vote YES.

@jakirkham
Copy link
Member

YES (as a member of ZSC)

@axtimwalde
Copy link

YES for n5-zarr

@rabernat
Copy link
Contributor

YES (as a member of ZSC)

@ryan-williams
Copy link
Member

YES (as a member of ZSC), very appreciative of everyone's work on this 🙏

@joshmoore
Copy link
Member

And with a final YES from the ZSC, I'd say that ZEP0001 has passed. 🎉 Endless thanks to everyone for the support in bootstrapping this process.

Now the exciting implementation phase begins. 👷‍♀️🤝👷‍♂️ For the @zarr-developers/implementation-council members and their collaborators, please don't hesitate to raise issues that arise.

Clarification and minor adjustment PRs are of course still within scope. By significant issues, further ZEPs may be necessary in order to align all the implementations but with ZEP0001 behind us I imagine that will be a breeze. 😉

@davidbrochart
Copy link
Contributor

A bit late, but I vote YES for xtensor-zarr.

@joshmoore
Copy link
Member

(Thanks, @davidbrochart. And sorry for having overlooked you! 😨)

@jhamman
Copy link
Member

jhamman commented Aug 8, 2023

Can this issue be closed now? And can the online version of the docs be updated to reflect that the spec as documented by ZEP0001 is no longer in DRAFT status?

@aliddell
Copy link

aliddell commented Aug 8, 2023

Can this issue be closed now? And can the online version of the docs be updated to reflect that the spec as documented by ZEP0001 is no longer in DRAFT status?

Came here to say this.

@joshmoore
Copy link
Member

Definitely closing this issue now 🎉 Currently the "draft" watermark functionality is an all-or-nothing kind of thing. @jstriebel was looking into having it on a per page basis.

@github-project-automation github-project-automation bot moved this from Meta to Done in ZEP1 Aug 8, 2023
@jhamman
Copy link
Member

jhamman commented Aug 8, 2023

Thanks @joshmoore!

Two PRs to make this clear:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests