Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zstd codec #256

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add zstd codec #256

wants to merge 1 commit into from

Conversation

jbms
Copy link
Contributor

@jbms jbms commented Aug 1, 2023

No description provided.

@jbms jbms requested a review from jstriebel August 1, 2023 07:41
@jbms
Copy link
Contributor Author

jbms commented Aug 1, 2023

@normanrz Please take a look.

Copy link
Member

@jstriebel jstriebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be part of ZEP 1? I'd argue for a mini-ZEP for this. IMO it's a good candidate to show that ZEPs can be quite small and concise, with a short duration for acceptance.

Apart from that LGTM!

@jbms
Copy link
Contributor Author

jbms commented Aug 3, 2023

Agreed that this should be separate from ZEP 1. I should update this document to not mention ZEP 1. But perhaps there should be a lighter weight process --- it would be nice to just get a vote on this PR directly, rather than writing up a separate ZEP document, etc.

There are going to be a lot of codecs added, and it would be nice to make that process relatively easy while still getting appropriate feedback.

@jstriebel
Copy link
Member

But perhaps there should be a lighter weight process --- it would be nice to just get a vote on this PR directly, rather than writing up a separate ZEP document, etc.

That would indeed be nice, but maybe we could have lean ZEPs, with less boilerplate and just pointing to an MR?

@MSanKeys963, what's your take on this?

@mkitti
Copy link
Contributor

mkitti commented May 2, 2024

Is there any chance we could change the default compression level to match ZSTD_CLEVEL_DEFAULT (3)?
https://github.com/facebook/zstd/blob/97291fc5020a8994019ab76cf0cda83a9824374c/lib/zstd.h#L129

@jbms
Copy link
Contributor Author

jbms commented May 3, 2024

This proposal doesn't indicate a default compression level if level is unspecified --- the level parameter is intended to be required, though I should clarify that.

When creating an array, implementations are free to provide a mechanism for the user to leave the level unspecified, however, and in that case the implementation must choose a default value.

The only note on default is that level 0 means the default level --- that follows the zstd C API.

If the array is intended to be used for writing, then I don't see much value in allowing the configuration options to be unspecified; that just leaves room for implementation variance. However, if someone is creating zarr metadata to somehow match some existing chunked data, that is known to be zstd compressed (but for which the level may not even be known), and which is intended to be read only, then the level is indeed not needed and it is a bit unfortunate that a fake value would have to be specified. Still I don't think we really need to optimize for this case.

@mkitti
Copy link
Contributor

mkitti commented May 8, 2024

Just to be clear, numcodecs.zstd.Zstd does document that the compression level defaults to 1 and sets DEFAULT_CLEVEL to 1. Default to the C header value would be a change from the current practice in Python's numcodecs.

https://numcodecs.readthedocs.io/en/stable/zstd.html#numcodecs.zstd.Zstd
https://github.com/zarr-developers/numcodecs/blob/79a0e933d78a51f32237b29fc375d0ee27aace37/numcodecs/zstd.pyx#L54

@mkitti
Copy link
Contributor

mkitti commented May 8, 2024

Furthermore, Python's numcodecs seems to disallow negative compression levels and interprets any nonpositive compression level as 1.

https://github.com/zarr-developers/numcodecs/blob/79a0e933d78a51f32237b29fc375d0ee27aace37/numcodecs/zstd.pyx#L83-L84

@jbms
Copy link
Contributor Author

jbms commented May 8, 2024

Default values are only really needed to support evolving the spec. Default values in the initial version of the spec basically just serve to save space in the stored metadata, which I think is pretty negligible in this case.

The zarr-python v3 implementation is still free to choose whichever parameters it likes if the user passes in zarr.codecs.Zstd() for example. Any defaults indicated in the spec would only apply when reading the stored metadata.

Supporting negative levels should be a trivial fix.

@mkitti
Copy link
Contributor

mkitti commented May 8, 2024

I think the fixes are trivial. The question is would it be considered breaking to change the default compression level?

It's not even clear that we need to include the compression level or the checksum flag in the codec configuration since that information is stored by the codec itself. The compression level is perhaps only really useful if someone is trying to add new compressed chunks to the dataset or if they are trying to update chunks. In that case we only need the information to be consistent with the rest of the dataset.

@normanrz
Copy link
Member

normanrz commented May 8, 2024

I think the fixes are trivial. The question is would it be considered breaking to change the default compression level?

I don't think the spec for the zstd codec should have default values. All options should be mandatory to set. As @jbms outlined, implementations can choose default values. I'll change the default level to 3 for zarr-python.

It's not even clear that we need to include the compression level or the checksum flag in the codec configuration since that information is stored by the codec itself. The compression level is perhaps only really useful if someone is trying to add new compressed chunks to the dataset or if they are trying to update chunks. In that case we only need the information to be consistent with the rest of the dataset.

The gzip and blosc codec also have mandatory level options (among others) for the reason you mentioned. I think the zstd codec should be consistent with that.

@jbms
Copy link
Contributor Author

jbms commented May 8, 2024

We could permit the parameters to be left unspecified in the metadata if only read operations are performed, and fail if any write operations are performed. However it is not clear that the added complexity of that is worth it --- it really just saves having to put in some arbitrary values if you don't know the level and only care about reading.

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

Is the checksum parameter really necessary here? I went though the list of implementations in different languages and it seems the large majority do not support adding a checksum to the compressed output. Wouldn't this limit how many Zarr implementations can support this codec's spec?

@normanrz
Copy link
Member

I know of zstd libraries for Python, C/C++, Java, Javascript and Rust that support the checksum flag. Which languages are you referring to?

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

I know of zstd libraries for Python, C/C++, Java, Javascript and Rust that support the checksum flag. Which languages are you referring to?

One example is OCaml. Both libraries available don't support this flag. Looking at the list of implementations at the Zstd official site, the other languages that don't include PHP, Fortran, Swift, Ruby, R, Perl, Common Lisp, Ada, Haskell, Julia, Racket, Nim, Elixir.

@mkitti
Copy link
Contributor

mkitti commented Sep 12, 2024

In Julia, the checksum flag is available through the low-level interface:
https://github.com/JuliaIO/CodecZstd.jl/blob/master/src%2FLibZstd_clang.jl#L154

@nhz2, perhaps we should discuss how to expose the entire parameter surface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants