Variable length data types (#40)

* Add support for variable length data types Data type sizes are now represented by `DataTypeSize` instead of usize Also adds `ArraySize`. Both are `Fixed` only for now. Need to support `Variable` throughout the codebase. Change codec API in prep for variable sized data types Enable `{Array,DataType}Size::Variable` Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets` Use `CowArrayBytes::validate()` impl `From` for `CowArrayBytes` for various types Array `_element` methods now use `T: Element` Add `vlen` codec metadata Fix codecs bench Implement an experimental vlen codec Use `impl Into<ArrayBytesCow<'a>>` in array methods Use `RawBytesCow` consistently Remove various vlen todo's Cleanup `ArrayBytes` Use `ArrayError::InvalidElementValue` for invalid string encodings Add `ArraySubset::contains()` Add `FillValue::new_empty()` Add remaining vlen support to array `store_` methods and improve vlen validation Add remaining vlen support to array `retrieve_` methods Partial decoding in the vlen filter Fix async vlen errors Sharding codec vlen support Add vlen support to sharding partial decoder vlen support for sharded_readable_ext `offsets_u64_to_usize` handle 32-bit system Minor FillValue doc update Remove unused ArraySubset methods and add related convenience functions Add cities test Add `Arrow32` vlen encoding Add support for Interleave32 (Zarr V2) vlen encoding fmt clippy Set minimum version for num-complex Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77 Add `binary` data type Vlen improve docs and test various encodings. Fix `cities.csv` encoding. `vlen` change encoding names Validate `vlen` codec `length32` encoding against `zarr-python` v2 Don't store `zarrs` metadata in cities test output Split `vlen` into `vlen` and `vlen_interleaved` Vlen supports separate index/dat encoding with full codec chains. Fix typesize in vlen `index_codecs` metadata Add support for `String` fill value metadata Add `FillValueMetadata::Unsupported` `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail. vlen cleanup Change vlen codec identifiers given they are experimental Move duplicate `extract_decoded_regions` fn into `array_bytes` + other minor changes Minor vlen_partial_decoder cleanup Add support for `zarr-python` nonconformant `|O` V2 data type Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3 Update root docs for new vlen related codecs/data types Cleanup `get_vlen_bytes_and_offsets` * Fix store value truncation * Add `ArraySize::new`, fix `ArrayBytes::new_fill_value`, fix `FillValue::equals_all` with vlen data * Add `array_write_read_string` example * Rename `vlen_interleaved` to `vlen_v2` * Fmt pass
LDeakin · Jul 25, 2024 · 649abe1 · 649abe1
1 parent d54b89d
commit 649abe1
Show file tree

Hide file tree

Showing 183 changed files with 53,233 additions and 2,208 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,19 +7,53 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+ - Add `ArrayBytes`, `RawBytes`, `RawBytesOffsets`, and `ArrayBytesError`
+    - These can represent array data with fixed and variable length data types
+ - Add `array::Element[Owned]` traits representing array elements
+    - Supports conversion to and from `ArrayBytes`
+ - Add `array::ElementFixedLength` marker trait
+ - Add experimental `vlen` and `vlen_v2` codec for variable length data types
+    - `vlen_v2` is for legacy support of Zarr V2 `vlen-utf8`/`vlen-bytes`/`vlen-array` codecs
+ - Add `DataType::{String,Binary}` data types
+    - These are likely to become standardised in the future and are not feature gated
+ - Add `ArraySubset::contains()`
+ - Add `FillValueMetadata::{String,Unsupported}`
+   - `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.
+ - Implement `From<{[u8; N],&[u8; N],String,&str}>` for `FillValue`
+ - Add `ArraySize` and `DataTypeSize`
+ - Add `DataType::fixed_size()` that returns `Option<usize>`. Returns `None` for variable length data types.
+ - Add `ArrayError::IncompatibleElementType` (replaces `ArrayError::IncompatibleElementSize`)
+ - Add `ArrayError::InvalidElementValue`
+ - Add `ChunkShape::num_elements_u64`
+
 ### Changed
  - Use `[async_]retrieve_array_subset_opt` internally in `Array::[async_]retrieve_chunks_opt`
  - **Breaking**: Replace `[Async]ArrayPartialDecoderTraits::element_size()` with `data_type()`
+ - Array `_store` methods now use `impl Into<ArrayBytes<'a>>` instead of `&[u8]` for the input bytes
+ - **Breaking**: Array `_store_{elements,ndarray}` methods now use `T: Element` instead of `T: bytemuck::Pod`
+ - **Breaking**: Array `_retrieve_{elements,ndarray}` methods now use `T: ElementOwned` instead of `T: bytemuck::Pod`
+ - Optimised `Array::[async_]store_array_subset_opt` when the subset is a subset of a single chunk
+ - Make `transmute_to_bytes` public
+ - Relax `ndarray_into_vec` from `T: bytemuck:Pod` to `T: Clone`
+ - **Breaking**: `DataType::size()` now returns a `DataTypeSize` instead of `usize`
+ - **Breaking**: `ArrayCodecTraits::{encode/decode}` have been specialised into `ArrayTo{Array,Bytes}CodecTraits::{encode/decode}`
 
 ### Removed
  - **Breaking**: Remove `into_array_view` array and codec API
    - This was not fully utilised, not applicable to variable sized data types, and quite unsafe for a public API
- - Remove internal `ChunksPerShardError` and just use `CodecError::Other`
+ - **Breaking**: Remove internal `ChunksPerShardError` and just use `CodecError::Other`
+ - **Breaking**: Remove `array_subset::{ArrayExtractBytesError,ArrayStoreBytesError}`
+ - **Breaking**: Remove `ArraySubset::{extract,store}_bytes[_unchecked]`, they are replaced by methods in `ArrayBytes`
+ - **Breaking**: Remove `array::validate_element_size` and `ArrayError::IncompatibleElementSize`
+    - The internal validation in array `_element` methods is now more strict than just matching the element size
+    - Example: `u16` must match `uint16` data type and will not match `int16` or `float16`
 
 ### Fixed
  - Fix an unnecessary copy in `async_store_set_partial_values`
  - Fix error when `bytes` metadata is encoded without a configuration, even if empty
  - Fix an error in `ChunkGrid` docs
+ - Fixed `[async_]store_set_partial_values` and `MemoryStore::set` to correctly truncate the bytes of store value if they shrink
 
 ## [0.15.1] - 2024-07-11
 

diff --git a/Cargo.toml b/Cargo.toml
@@ -43,7 +43,7 @@ async-lock = { version = "3.2.0", optional = true }
 async-recursion = { version = "1.0.5", optional = true }
 async-trait = { version = "0.1.74", optional = true }
 blosc-sys = { version = "0.3.4", package = "blosc-src", features = ["snappy", "lz4", "zlib", "zstd"], optional = true }
-bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast"] }
+bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast", "min_const_generics"] }
 bytes = "1.6.0"
 bzip2 = { version = "0.4.4", optional = true, features = ["static"] }
 crc32c = { version = "0.6.5", optional = true }
@@ -75,6 +75,10 @@ zfp-sys = {version = "0.1.15", features = ["static"], optional = true }
 zip = { version = "2.1.3", optional = true }
 zstd = { version = "0.13.1", features = ["zstdmt"], optional = true }
 
+[dependencies.num-complex]
+version = "0.4.3"
+features = ["bytemuck"]
+
 [dev-dependencies]
 chrono = "0.4"
 criterion = "0.5.1"
@@ -93,6 +97,11 @@ name = "array_write_read_ndarray"
 required-features = ["ndarray"]
 doc-scrape-examples = true
 
+[[example]]
+name = "array_write_read_string"
+required-features = ["ndarray"]
+doc-scrape-examples = true
+
 [[example]]
 name = "async_array_write_read"
 required-features = ["ndarray", "async", "object_store"]

diff --git a/benches/codecs.rs b/benches/codecs.rs
@@ -8,9 +8,10 @@ use zarrs::array::{
     codec::{
         array_to_bytes::bytes::Endianness,
         bytes_to_bytes::blosc::{BloscCompressor, BloscShuffleMode},
-        ArrayCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits, CodecOptions,
+        ArrayCodecTraits, ArrayToBytesCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits,
+        CodecOptions,
     },
-    BytesRepresentation, ChunkRepresentation, DataType,
+    BytesRepresentation, ChunkRepresentation, DataType, Element,
 };
 
 fn codec_bytes(c: &mut Criterion) {
@@ -35,12 +36,13 @@ fn codec_bytes(c: &mut Criterion) {
         .unwrap();
 
         let data = vec![0u8; size3.try_into().unwrap()];
+        let bytes = Element::into_array_bytes(&DataType::UInt8, &data).unwrap();
         group.throughput(Throughput::Bytes(size3));
         // encode and decode have the same implementation
         group.bench_function(BenchmarkId::new("encode_decode", size3), |b| {
             b.iter(|| {
                 codec
-                    .encode(Cow::Borrowed(&data), &rep, &CodecOptions::default())
+                    .encode(bytes.clone(), &rep, &CodecOptions::default())
                     .unwrap()
             });
         });

diff --git a/doc/status/codecs.md b/doc/status/codecs.md
@@ -1,16 +1,18 @@
-| Codec Type     | Codec<sup>†</sup>         | ZEP               | V3      | V2      | Feature Flag* |
-| -------------- | ------------------------- | ----------------- | ------- | ------- | ------------- |
-| Array to Array | [transpose]               | [ZEP0001]         | &check; |         | **transpose** |
-|                | [bitround] (experimental) |                   | &check; |         | bitround      |
-| Array to Bytes | [bytes]                   | [ZEP0001]         | &check; |         |               |
-|                | [sharding_indexed]        | [ZEP0002]         | &check; |         | **sharding**  |
-|                | [zfp] (experimental)      |                   | &check; |         | zfp           |
-|                | [pcodec] (experimental)   |                   | &check; |         | pcodec        |
-| Bytes to Bytes | [blosc]                   | [ZEP0001]         | &check; | &check; | **blosc**     |
-|                | [gzip]                    | [ZEP0001]         | &check; | &check; | **gzip**      |
-|                | [crc32c]                  | [ZEP0002]         | &check; |         | **crc32c**    |
-|                | [zstd]                    | [zarr-specs #256] | &check; |         | zstd          |
-|                | [bz2] (experimental)      |                   | &check; | &check; | bz2           |
+| Codec Type     | Codec<sup>†</sup>                                 | ZEP               | V3      | V2      | Feature Flag* |
+| -------------- | ------------------------------------------------- | ----------------- | ------- | ------- | ------------- |
+| Array to Array | [transpose]                                       | [ZEP0001]         | &check; |         | **transpose** |
+|                | [bitround] (experimental)                         |                   | &check; |         | bitround      |
+| Array to Bytes | [bytes]                                           | [ZEP0001]         | &check; |         |               |
+|                | [sharding_indexed]                                | [ZEP0002]         | &check; |         | **sharding**  |
+|                | [zfp] (experimental)                              |                   | &check; |         | zfp           |
+|                | [pcodec] (experimental)                           |                   | &check; |         | pcodec        |
+|                | [vlen] (experimental)                             |                   | &check; |         |               |
+|                | [vlen_v2] (experimental)<br>`vlen-*` in Zarr V2   |                   | &check; | &check; |               |
+| Bytes to Bytes | [blosc]                                           | [ZEP0001]         | &check; | &check; | **blosc**     |
+|                | [gzip]                                            | [ZEP0001]         | &check; | &check; | **gzip**      |
+|                | [crc32c]                                          | [ZEP0002]         | &check; |         | **crc32c**    |
+|                | [zstd]                                            | [zarr-specs #256] | &check; |         | zstd          |
+|                | [bz2] (experimental)                              |                   | &check; | &check; | bz2           |
 
 <sup>\* Bolded feature flags are part of the default set of features.</sup>
 <br>
@@ -31,12 +33,16 @@
 [crc32c]: crate::array::codec::bytes_to_bytes::crc32c
 [zstd]: crate::array::codec::bytes_to_bytes::zstd
 [bz2]: crate::array::codec::bytes_to_bytes::bz2
+[vlen]: crate::array::codec::array_to_bytes::vlen
+[vlen_v2]: crate::array::codec::array_to_bytes::vlen_v2
 
 The `"name"` of of experimental codecs in array metadata links the codec documentation in this crate.
 
-| Experimental Codec | Name / URI                                        |
-| ------------------ | ------------------------------------------------- |
-| `bitround`         | <https://codec.zarrs.dev/array_to_array/bitround> |
-| `zfp`              | <https://codec.zarrs.dev/array_to_bytes/zfp>      |
-| `pcodec`           | <https://codec.zarrs.dev/array_to_bytes/pcodec>   |
-| `bz2`              | <https://codec.zarrs.dev/bytes_to_bytes/bz2>      |
+| Experimental Codec | Name / URI                                               |
+| ------------------ | -------------------------------------------------------- |
+| `bitround`         | <https://codec.zarrs.dev/array_to_array/bitround>        |
+| `zfp`              | <https://codec.zarrs.dev/array_to_bytes/zfp>             |
+| `pcodec`           | <https://codec.zarrs.dev/array_to_bytes/pcodec>          |
+| `bz2`              | <https://codec.zarrs.dev/bytes_to_bytes/bz2>             |
+| `vlen`             | <https://codec.zarrs.dev/array_to_array/vlen>            |
+| `vlen_v2` | <https://codec.zarrs.dev/array_to_array/zfp_interleaved> |
diff --git a/doc/status/data_types.md b/doc/status/data_types.md
@@ -1,8 +1,12 @@
-| Data Type | ZEP | V3 | V2 | Feature Flag |
+| Data Type<sup>†</sup> | ZEP | V3 | V2 | Feature Flag |
 | --------- | --- | ----- | -- | ------------ |
 | [bool]<br>[int8] [int16] [int32] [int64] [uint8] [uint16] [uint32] [uint64]<br>[float16] [float32] [float64]<br>[complex64] [complex128] | [ZEP0001] | &check; | &check; | |
 [r* (raw bits)] | [ZEP0001] | &check; | | |
 | [bfloat16] | [zarr-specs #130] | &check; | | |
+| [string] (experimental) | [ZEP0007 (draft)] | &check; | | |
+| [binary] (experimental) | [ZEP0007 (draft)] | &check; | | |
+
+<sup>† Experimental data types are recommended for evaluation only.</sup>
 
 [bool]: crate::array::data_type::DataType::Bool
 [int8]: crate::array::data_type::DataType::Int8
@@ -20,6 +24,9 @@
 [complex128]: crate::array::data_type::DataType::Complex128
 [bfloat16]: crate::array::data_type::DataType::BFloat16
 [r* (raw bits)]: crate::array::data_type::DataType::RawBits
+[string]: crate::array::data_type::DataType::String
+[binary]: crate::array::data_type::DataType::Binary
 
 [ZEP0001]: https://zarr.dev/zeps/accepted/ZEP0001.html
 [zarr-specs #130]: https://github.com/zarr-developers/zarr-specs/issues/130
+[ZEP0007 (draft)]: https://github.com/zarr-developers/zeps/pull/47
diff --git a/examples/array_write_read_string.rs b/examples/array_write_read_string.rs
@@ -0,0 +1,119 @@
+use itertools::Itertools;
+use ndarray::{array, Array2, ArrayD};
+use zarrs::storage::{
+    storage_transformer::{StorageTransformerExtension, UsageLogStorageTransformer},
+    ReadableWritableListableStorage,
+};
+
+fn array_write_read() -> Result<(), Box<dyn std::error::Error>> {
+    use std::sync::Arc;
+    use zarrs::{
+        array::{DataType, FillValue},
+        array_subset::ArraySubset,
+        storage::store,
+    };
+
+    // Create a store
+    // let path = tempfile::TempDir::new()?;
+    // let mut store: ReadableWritableListableStorage = Arc::new(store::FilesystemStore::new(path.path())?);
+    // let mut store: ReadableWritableListableStorage = Arc::new(store::FilesystemStore::new(
+    //     "tests/data/array_write_read.zarr",
+    // )?);
+    let mut store: ReadableWritableListableStorage = Arc::new(store::MemoryStore::new());
+    if let Some(arg1) = std::env::args().collect::<Vec<_>>().get(1) {
+        if arg1 == "--usage-log" {
+            let log_writer = Arc::new(std::sync::Mutex::new(
+                // std::io::BufWriter::new(
+                std::io::stdout(),
+                //    )
+            ));
+            let usage_log = Arc::new(UsageLogStorageTransformer::new(log_writer, || {
+                chrono::Utc::now().format("[%T%.3f] ").to_string()
+            }));
+            store = usage_log
+                .clone()
+                .create_readable_writable_listable_transformer(store);
+        }
+    }
+
+    // Create a group
+    let group_path = "/group";
+    let mut group = zarrs::group::GroupBuilder::new().build(store.clone(), group_path)?;
+
+    // Update group metadata
+    group
+        .attributes_mut()
+        .insert("foo".into(), serde_json::Value::String("bar".into()));
+
+    // Write group metadata to store
+    group.store_metadata()?;
+
+    println!(
+        "The group metadata is:\n{}\n",
+        serde_json::to_string_pretty(&group.metadata()).unwrap()
+    );
+
+    // Create an array
+    let array_path = "/group/array";
+    let array = zarrs::array::ArrayBuilder::new(
+        vec![4, 4], // array shape
+        DataType::String,
+        vec![2, 2].try_into()?, // regular chunk shape
+        FillValue::from("_"),
+    )
+    // .bytes_to_bytes_codecs(vec![]) // uncompressed
+    .dimension_names(["y", "x"].into())
+    // .storage_transformers(vec![].into())
+    .build(store.clone(), array_path)?;
+
+    // Write array metadata to store
+    array.store_metadata()?;
+
+    println!(
+        "The array metadata is:\n{}\n",
+        serde_json::to_string_pretty(&array.metadata()).unwrap()
+    );
+
+    // Write some chunks
+    array.store_chunk_ndarray(
+        &[0, 0],
+        ArrayD::<&str>::from_shape_vec(vec![2, 2], vec!["a", "bb", "ccc", "dddd"]).unwrap(),
+    )?;
+    array.store_chunk_ndarray(
+        &[0, 1],
+        ArrayD::<&str>::from_shape_vec(vec![2, 2], vec!["4444", "333", "22", "1"]).unwrap(),
+    )?;
+    let subset_all = ArraySubset::new_with_shape(array.shape().to_vec());
+    let data_all = array.retrieve_array_subset_ndarray::<String>(&subset_all)?;
+    println!("store_chunk [0, 0] and [0, 1]:\n{data_all}\n");
+
+    // Write a subset spanning multiple chunks, including updating chunks already written
+    let ndarray_subset: Array2<&str> = array![["!", "@@"], ["###", "$$$$"]];
+    array.store_array_subset_ndarray(
+        ArraySubset::new_with_ranges(&[1..3, 1..3]).start(),
+        ndarray_subset,
+    )?;
+    let data_all = array.retrieve_array_subset_ndarray::<String>(&subset_all)?;
+    println!("store_array_subset [1..3, 1..3]:\nndarray::ArrayD<String>\n{data_all}");
+
+    // Retrieve bytes directly, convert into a single string allocation, create a &str ndarray
+    // TODO: Add a convenience function for this?
+    let data_all = array.retrieve_array_subset(&subset_all)?;
+    let (bytes, offsets) = data_all.into_variable()?;
+    let string = String::from_utf8(bytes.into_owned())?;
+    let elements = offsets
+        .iter()
+        .tuple_windows()
+        .map(|(&curr, &next)| &string[curr..next])
+        .collect::<Vec<&str>>();
+    let ndarray = ArrayD::<&str>::from_shape_vec(subset_all.shape_usize(), elements)?;
+    println!("ndarray::ArrayD<&str>:\n{ndarray}");
+
+    Ok(())
+}
+
+fn main() {
+    if let Err(err) = array_write_read() {
+        println!("{:?}", err);
+    }
+}
diff --git a/examples/sharded_array_write_read.rs b/examples/sharded_array_write_read.rs
@@ -137,15 +137,15 @@ fn sharded_array_write_read() -> Result<(), Box<dyn std::error::Error>> {
         ArraySubset::new_with_start_shape(vec![0, 4], inner_chunk_shape.clone())?,
     ];
     let decoded_inner_chunks_bytes = partial_decoder.partial_decode(&inner_chunks_to_decode)?;
-    let decoded_inner_chunks_ndarray = decoded_inner_chunks_bytes
-        .into_iter()
-        .map(|bytes| bytes_to_ndarray::<u16>(&inner_chunk_shape, bytes.to_vec()))
-        .collect::<Result<Vec<_>, _>>()?;
     println!("Decoded inner chunks:");
     for (inner_chunk_subset, decoded_inner_chunk) in
-        std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_ndarray)
+        std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_bytes)
     {
-        println!("{inner_chunk_subset}\n{decoded_inner_chunk}\n");
+        let ndarray = bytes_to_ndarray::<u16>(
+            &inner_chunk_shape,
+            decoded_inner_chunk.into_fixed()?.into_owned(),
+        )?;
+        println!("{inner_chunk_subset}\n{ndarray}\n");
     }
 
     // Show the hierarchy