-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable length data types #40
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #40 +/- ##
==========================================
+ Coverage 79.56% 81.33% +1.76%
==========================================
Files 142 152 +10
Lines 19544 20837 +1293
==========================================
+ Hits 15550 16947 +1397
+ Misses 3994 3890 -104 ☔ View full report in Codecov by Sentry. |
LDeakin
force-pushed
the
variable_length_data_types
branch
4 times, most recently
from
July 18, 2024 01:28
79c61cc
to
ac7ebc7
Compare
LDeakin
force-pushed
the
variable_length_data_types
branch
2 times, most recently
from
July 25, 2024 01:24
5bc59ea
to
b3110ea
Compare
Data type sizes are now represented by `DataTypeSize` instead of usize Also adds `ArraySize`. Both are `Fixed` only for now. Need to support `Variable` throughout the codebase. Change codec API in prep for variable sized data types Enable `{Array,DataType}Size::Variable` Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets` Use `CowArrayBytes::validate()` impl `From` for `CowArrayBytes` for various types Array `_element` methods now use `T: Element` Add `vlen` codec metadata Fix codecs bench Implement an experimental vlen codec Use `impl Into<ArrayBytesCow<'a>>` in array methods Use `RawBytesCow` consistently Remove various vlen todo's Cleanup `ArrayBytes` Use `ArrayError::InvalidElementValue` for invalid string encodings Add `ArraySubset::contains()` Add `FillValue::new_empty()` Add remaining vlen support to array `store_` methods and improve vlen validation Add remaining vlen support to array `retrieve_` methods Partial decoding in the vlen filter Fix async vlen errors Sharding codec vlen support Add vlen support to sharding partial decoder vlen support for sharded_readable_ext `offsets_u64_to_usize` handle 32-bit system Minor FillValue doc update Remove unused ArraySubset methods and add related convenience functions Add cities test Add `Arrow32` vlen encoding Add support for Interleave32 (Zarr V2) vlen encoding fmt clippy Set minimum version for num-complex Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77 Add `binary` data type Vlen improve docs and test various encodings. Fix `cities.csv` encoding. `vlen` change encoding names Validate `vlen` codec `length32` encoding against `zarr-python` v2 Don't store `zarrs` metadata in cities test output Split `vlen` into `vlen` and `vlen_interleaved` Vlen supports separate index/dat encoding with full codec chains. Fix typesize in vlen `index_codecs` metadata Add support for `String` fill value metadata Add `FillValueMetadata::Unsupported` `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail. vlen cleanup Change vlen codec identifiers given they are experimental Move duplicate `extract_decoded_regions` fn into `array_bytes` + other minor changes Minor vlen_partial_decoder cleanup Add support for `zarr-python` nonconformant `|O` V2 data type Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3 Update root docs for new vlen related codecs/data types Cleanup `get_vlen_bytes_and_offsets`
LDeakin
force-pushed
the
variable_length_data_types
branch
from
July 25, 2024 01:35
77e9e4e
to
0c114bf
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #21.
This is a substantial change that adds support for variable length data types to
zarrs
.There were some breaking changes necessary to support this:
ArrayBytes
which can represent fixed or variable length bytes, rather than just a slice-likeElement[Owned]
traits, with better validationRawBytes
ArrayBytes
andRawBytes
Data types
Codecs
vlen
Based on zarr-developers/zeps#47 (comment).
Structure:
uint64
representing the size in bytes of the encoded index,index_codecs
,data_codecs
.The encoded index size is necessary to support index compression and partial decoding. If this were not available, the index could not used a bytes-to-bytes compression codec. A bytes-to-bytes compression codec could follow
vlen
, but then "data" is potentially running through a compression codec twice.vlen_v2
This matches Zarr V2 style interleaved encoding, which is implemented by numcodecs
vlen-utf8
,vlen-bytes
, andvlen-array
. These are all essentially the same codec, with data type-dependent behaviour. It makes sense to standardise a single codec for Zarr V3 to support Zarr V2vlen-utf8/bytes/array
encoded data without reencoding chunks.Encoding Efficiency (32-bit index)
Sum of chunk sizes (in bytes) on "city" column of zarr-developers/zarr-python#2036 (comment).
https://github.com/LDeakin/zarrs/blob/variable_length_data_types/tests/cities.rs.