Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable length data types #40

Merged
merged 6 commits into from
Jul 25, 2024
Merged

Variable length data types #40

merged 6 commits into from
Jul 25, 2024

Conversation

LDeakin
Copy link
Owner

@LDeakin LDeakin commented Jul 17, 2024

Resolves #21.

This is a substantial change that adds support for variable length data types to zarrs.
There were some breaking changes necessary to support this:

  • Array store/retrieve "bytes" methods now take/return ArrayBytes which can represent fixed or variable length bytes, rather than just a slice-like
  • Array store/retrieve "elements" variants use new Element[Owned] traits, with better validation
  • Encoded bytes are aliased to RawBytes
  • Codec traits have had some changes to accommodate the distinction between ArrayBytes and RawBytes

Data types

  • String (utf-8)
  • Binary

Codecs

vlen

{
  "name": "vlen",
  "configuration": {
    "data_codecs": [{"name": "bytes"},{"name": "blosc","configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}],
    "index_codecs": [{"name": "bytes","configuration": { "endian": "little" }},{"name": "blosc","configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}],
    "index_data_type": "uint32"
  }
}

Based on zarr-developers/zeps#47 (comment).

Structure:

The encoded index size is necessary to support index compression and partial decoding. If this were not available, the index could not used a bytes-to-bytes compression codec. A bytes-to-bytes compression codec could follow vlen, but then "data" is potentially running through a compression codec twice.

vlen_v2

{
  "name": "vlen_v2"
}

This matches Zarr V2 style interleaved encoding, which is implemented by numcodecs vlen-utf8, vlen-bytes, and vlen-array. These are all essentially the same codec, with data type-dependent behaviour. It makes sense to standardise a single codec for Zarr V3 to support Zarr V2 vlen-utf8/bytes/array encoded data without reencoding chunks.

Encoding Efficiency (32-bit index)

Sum of chunk sizes (in bytes) on "city" column of zarr-developers/zarr-python#2036 (comment).

https://github.com/LDeakin/zarrs/blob/variable_length_data_types/tests/cities.rs.

encoding compression size
vlen_v2 642196
vlen_v2 zstd 5 362626
vlen 642580
vlen zstd 5 346950

Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 85.78135% with 424 lines in your changes missing coverage. Please review.

Project coverage is 81.33%. Comparing base (d54b89d) to head (71b9d78).

Files Patch % Lines
src/array/element.rs 65.41% 46 Missing ⚠️
...rray_to_bytes/sharding/sharding_partial_decoder.rs 88.28% 39 Missing ⚠️
src/array/array_sync_sharded_readable_ext.rs 63.95% 31 Missing ⚠️
...en_interleaved/vlen_interleaved_partial_decoder.rs 53.03% 31 Missing ⚠️
src/array/codec/array_to_bytes/vlen.rs 71.13% 28 Missing ⚠️
src/array/array_bytes.rs 93.63% 25 Missing ⚠️
...o_bytes/vlen_interleaved/vlen_interleaved_codec.rs 78.72% 20 Missing ⚠️
src/array/array_representation.rs 56.75% 16 Missing ⚠️
src/array/codec/array_to_bytes/vlen_interleaved.rs 68.00% 16 Missing ⚠️
src/array/array_async_readable_writable.rs 68.08% 15 Missing ⚠️
... and 27 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   79.56%   81.33%   +1.76%     
==========================================
  Files         142      152      +10     
  Lines       19544    20837    +1293     
==========================================
+ Hits        15550    16947    +1397     
+ Misses       3994     3890     -104     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@LDeakin LDeakin force-pushed the variable_length_data_types branch 4 times, most recently from 79c61cc to ac7ebc7 Compare July 18, 2024 01:28
@LDeakin LDeakin force-pushed the variable_length_data_types branch 2 times, most recently from 5bc59ea to b3110ea Compare July 25, 2024 01:24
Data type sizes are now represented by `DataTypeSize` instead of usize

Also adds `ArraySize`.
Both are `Fixed` only for now. Need to support `Variable` throughout the codebase.

Change codec API in prep for variable sized data types

Enable `{Array,DataType}Size::Variable`

Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets`

Use `CowArrayBytes::validate()`

impl `From` for `CowArrayBytes` for various types

Array `_element` methods now use `T: Element`

Add `vlen` codec metadata

Fix codecs bench

Implement an experimental vlen codec

Use `impl Into<ArrayBytesCow<'a>>` in array methods

Use `RawBytesCow` consistently

Remove various vlen todo's

Cleanup `ArrayBytes`

Use `ArrayError::InvalidElementValue` for invalid string encodings

Add `ArraySubset::contains()`

Add `FillValue::new_empty()`

Add remaining vlen support to array `store_` methods and improve vlen validation

Add remaining vlen support to array `retrieve_` methods

Partial decoding in the vlen filter

Fix async vlen errors

Sharding codec vlen support

Add vlen support to sharding partial decoder

vlen support for sharded_readable_ext

`offsets_u64_to_usize` handle 32-bit system

Minor FillValue doc update

Remove unused ArraySubset methods and add related convenience functions

Add cities test

Add `Arrow32` vlen encoding

Add support for Interleave32 (Zarr V2) vlen encoding

fmt

clippy

Set minimum version for num-complex

Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77

Add `binary` data type

Vlen improve docs and test various encodings.

Fix `cities.csv` encoding.

`vlen` change encoding names

Validate `vlen` codec `length32` encoding against `zarr-python` v2

Don't store `zarrs` metadata in cities test output

Split `vlen` into `vlen` and `vlen_interleaved`

Vlen supports separate index/dat encoding with full codec chains.

Fix typesize in vlen `index_codecs` metadata

Add support for `String` fill value metadata

Add `FillValueMetadata::Unsupported`

`ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.

vlen cleanup

Change vlen codec identifiers given they are experimental

Move duplicate `extract_decoded_regions` fn into `array_bytes`

+ other minor changes

Minor vlen_partial_decoder cleanup

Add support for `zarr-python` nonconformant `|O` V2 data type

Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3

Update root docs for new vlen related codecs/data types

Cleanup `get_vlen_bytes_and_offsets`
@LDeakin LDeakin force-pushed the variable_length_data_types branch from 77e9e4e to 0c114bf Compare July 25, 2024 01:35
@LDeakin LDeakin marked this pull request as ready for review July 25, 2024 04:28
@LDeakin LDeakin merged commit 649abe1 into main Jul 25, 2024
18 checks passed
@LDeakin LDeakin deleted the variable_length_data_types branch July 25, 2024 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to read/write string Arrays?
1 participant