Skip to content

Commit

Permalink
Variable length data types (#40)
Browse files Browse the repository at this point in the history
* Add support for variable length data types

Data type sizes are now represented by `DataTypeSize` instead of usize

Also adds `ArraySize`.
Both are `Fixed` only for now. Need to support `Variable` throughout the codebase.

Change codec API in prep for variable sized data types

Enable `{Array,DataType}Size::Variable`

Implement `CowArrayBytes::validate()` and add `CodecError::InvalidVariableSizedArrayOffsets`

Use `CowArrayBytes::validate()`

impl `From` for `CowArrayBytes` for various types

Array `_element` methods now use `T: Element`

Add `vlen` codec metadata

Fix codecs bench

Implement an experimental vlen codec

Use `impl Into<ArrayBytesCow<'a>>` in array methods

Use `RawBytesCow` consistently

Remove various vlen todo's

Cleanup `ArrayBytes`

Use `ArrayError::InvalidElementValue` for invalid string encodings

Add `ArraySubset::contains()`

Add `FillValue::new_empty()`

Add remaining vlen support to array `store_` methods and improve vlen validation

Add remaining vlen support to array `retrieve_` methods

Partial decoding in the vlen filter

Fix async vlen errors

Sharding codec vlen support

Add vlen support to sharding partial decoder

vlen support for sharded_readable_ext

`offsets_u64_to_usize` handle 32-bit system

Minor FillValue doc update

Remove unused ArraySubset methods and add related convenience functions

Add cities test

Add `Arrow32` vlen encoding

Add support for Interleave32 (Zarr V2) vlen encoding

fmt

clippy

Set minimum version for num-complex

Fix `ArrayBytes` from `&[u8; N]` for rust < 1.77

Add `binary` data type

Vlen improve docs and test various encodings.

Fix `cities.csv` encoding.

`vlen` change encoding names

Validate `vlen` codec `length32` encoding against `zarr-python` v2

Don't store `zarrs` metadata in cities test output

Split `vlen` into `vlen` and `vlen_interleaved`

Vlen supports separate index/dat encoding with full codec chains.

Fix typesize in vlen `index_codecs` metadata

Add support for `String` fill value metadata

Add `FillValueMetadata::Unsupported`

`ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.

vlen cleanup

Change vlen codec identifiers given they are experimental

Move duplicate `extract_decoded_regions` fn into `array_bytes`

+ other minor changes

Minor vlen_partial_decoder cleanup

Add support for `zarr-python` nonconformant `|O` V2 data type

Support conversion of Zarr V2 arrats with `vlen-*` codecs to V3

Update root docs for new vlen related codecs/data types

Cleanup `get_vlen_bytes_and_offsets`

* Fix store value truncation

* Add `ArraySize::new`, fix `ArrayBytes::new_fill_value`, fix `FillValue::equals_all` with vlen data

* Add `array_write_read_string` example

* Rename `vlen_interleaved` to `vlen_v2`

* Fmt pass
  • Loading branch information
LDeakin authored Jul 25, 2024
1 parent d54b89d commit 649abe1
Show file tree
Hide file tree
Showing 183 changed files with 53,233 additions and 2,208 deletions.
36 changes: 35 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,53 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- Add `ArrayBytes`, `RawBytes`, `RawBytesOffsets`, and `ArrayBytesError`
- These can represent array data with fixed and variable length data types
- Add `array::Element[Owned]` traits representing array elements
- Supports conversion to and from `ArrayBytes`
- Add `array::ElementFixedLength` marker trait
- Add experimental `vlen` and `vlen_v2` codec for variable length data types
- `vlen_v2` is for legacy support of Zarr V2 `vlen-utf8`/`vlen-bytes`/`vlen-array` codecs
- Add `DataType::{String,Binary}` data types
- These are likely to become standardised in the future and are not feature gated
- Add `ArraySubset::contains()`
- Add `FillValueMetadata::{String,Unsupported}`
- `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.
- Implement `From<{[u8; N],&[u8; N],String,&str}>` for `FillValue`
- Add `ArraySize` and `DataTypeSize`
- Add `DataType::fixed_size()` that returns `Option<usize>`. Returns `None` for variable length data types.
- Add `ArrayError::IncompatibleElementType` (replaces `ArrayError::IncompatibleElementSize`)
- Add `ArrayError::InvalidElementValue`
- Add `ChunkShape::num_elements_u64`

### Changed
- Use `[async_]retrieve_array_subset_opt` internally in `Array::[async_]retrieve_chunks_opt`
- **Breaking**: Replace `[Async]ArrayPartialDecoderTraits::element_size()` with `data_type()`
- Array `_store` methods now use `impl Into<ArrayBytes<'a>>` instead of `&[u8]` for the input bytes
- **Breaking**: Array `_store_{elements,ndarray}` methods now use `T: Element` instead of `T: bytemuck::Pod`
- **Breaking**: Array `_retrieve_{elements,ndarray}` methods now use `T: ElementOwned` instead of `T: bytemuck::Pod`
- Optimised `Array::[async_]store_array_subset_opt` when the subset is a subset of a single chunk
- Make `transmute_to_bytes` public
- Relax `ndarray_into_vec` from `T: bytemuck:Pod` to `T: Clone`
- **Breaking**: `DataType::size()` now returns a `DataTypeSize` instead of `usize`
- **Breaking**: `ArrayCodecTraits::{encode/decode}` have been specialised into `ArrayTo{Array,Bytes}CodecTraits::{encode/decode}`

### Removed
- **Breaking**: Remove `into_array_view` array and codec API
- This was not fully utilised, not applicable to variable sized data types, and quite unsafe for a public API
- Remove internal `ChunksPerShardError` and just use `CodecError::Other`
- **Breaking**: Remove internal `ChunksPerShardError` and just use `CodecError::Other`
- **Breaking**: Remove `array_subset::{ArrayExtractBytesError,ArrayStoreBytesError}`
- **Breaking**: Remove `ArraySubset::{extract,store}_bytes[_unchecked]`, they are replaced by methods in `ArrayBytes`
- **Breaking**: Remove `array::validate_element_size` and `ArrayError::IncompatibleElementSize`
- The internal validation in array `_element` methods is now more strict than just matching the element size
- Example: `u16` must match `uint16` data type and will not match `int16` or `float16`

### Fixed
- Fix an unnecessary copy in `async_store_set_partial_values`
- Fix error when `bytes` metadata is encoded without a configuration, even if empty
- Fix an error in `ChunkGrid` docs
- Fixed `[async_]store_set_partial_values` and `MemoryStore::set` to correctly truncate the bytes of store value if they shrink

## [0.15.1] - 2024-07-11

Expand Down
11 changes: 10 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ async-lock = { version = "3.2.0", optional = true }
async-recursion = { version = "1.0.5", optional = true }
async-trait = { version = "0.1.74", optional = true }
blosc-sys = { version = "0.3.4", package = "blosc-src", features = ["snappy", "lz4", "zlib", "zstd"], optional = true }
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast"] }
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast", "min_const_generics"] }
bytes = "1.6.0"
bzip2 = { version = "0.4.4", optional = true, features = ["static"] }
crc32c = { version = "0.6.5", optional = true }
Expand Down Expand Up @@ -75,6 +75,10 @@ zfp-sys = {version = "0.1.15", features = ["static"], optional = true }
zip = { version = "2.1.3", optional = true }
zstd = { version = "0.13.1", features = ["zstdmt"], optional = true }

[dependencies.num-complex]
version = "0.4.3"
features = ["bytemuck"]

[dev-dependencies]
chrono = "0.4"
criterion = "0.5.1"
Expand All @@ -93,6 +97,11 @@ name = "array_write_read_ndarray"
required-features = ["ndarray"]
doc-scrape-examples = true

[[example]]
name = "array_write_read_string"
required-features = ["ndarray"]
doc-scrape-examples = true

[[example]]
name = "async_array_write_read"
required-features = ["ndarray", "async", "object_store"]
Expand Down
8 changes: 5 additions & 3 deletions benches/codecs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ use zarrs::array::{
codec::{
array_to_bytes::bytes::Endianness,
bytes_to_bytes::blosc::{BloscCompressor, BloscShuffleMode},
ArrayCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits, CodecOptions,
ArrayCodecTraits, ArrayToBytesCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits,
CodecOptions,
},
BytesRepresentation, ChunkRepresentation, DataType,
BytesRepresentation, ChunkRepresentation, DataType, Element,
};

fn codec_bytes(c: &mut Criterion) {
Expand All @@ -35,12 +36,13 @@ fn codec_bytes(c: &mut Criterion) {
.unwrap();

let data = vec![0u8; size3.try_into().unwrap()];
let bytes = Element::into_array_bytes(&DataType::UInt8, &data).unwrap();
group.throughput(Throughput::Bytes(size3));
// encode and decode have the same implementation
group.bench_function(BenchmarkId::new("encode_decode", size3), |b| {
b.iter(|| {
codec
.encode(Cow::Borrowed(&data), &rep, &CodecOptions::default())
.encode(bytes.clone(), &rep, &CodecOptions::default())
.unwrap()
});
});
Expand Down
44 changes: 25 additions & 19 deletions doc/status/codecs.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
| Codec Type | Codec<sup>†</sup> | ZEP | V3 | V2 | Feature Flag* |
| -------------- | ------------------------- | ----------------- | ------- | ------- | ------------- |
| Array to Array | [transpose] | [ZEP0001] | &check; | | **transpose** |
| | [bitround] (experimental) | | &check; | | bitround |
| Array to Bytes | [bytes] | [ZEP0001] | &check; | | |
| | [sharding_indexed] | [ZEP0002] | &check; | | **sharding** |
| | [zfp] (experimental) | | &check; | | zfp |
| | [pcodec] (experimental) | | &check; | | pcodec |
| Bytes to Bytes | [blosc] | [ZEP0001] | &check; | &check; | **blosc** |
| | [gzip] | [ZEP0001] | &check; | &check; | **gzip** |
| | [crc32c] | [ZEP0002] | &check; | | **crc32c** |
| | [zstd] | [zarr-specs #256] | &check; | | zstd |
| | [bz2] (experimental) | | &check; | &check; | bz2 |
| Codec Type | Codec<sup>†</sup> | ZEP | V3 | V2 | Feature Flag* |
| -------------- | ------------------------------------------------- | ----------------- | ------- | ------- | ------------- |
| Array to Array | [transpose] | [ZEP0001] | &check; | | **transpose** |
| | [bitround] (experimental) | | &check; | | bitround |
| Array to Bytes | [bytes] | [ZEP0001] | &check; | | |
| | [sharding_indexed] | [ZEP0002] | &check; | | **sharding** |
| | [zfp] (experimental) | | &check; | | zfp |
| | [pcodec] (experimental) | | &check; | | pcodec |
| | [vlen] (experimental) | | &check; | | |
| | [vlen_v2] (experimental)<br>`vlen-*` in Zarr V2 | | &check; | &check; | |
| Bytes to Bytes | [blosc] | [ZEP0001] | &check; | &check; | **blosc** |
| | [gzip] | [ZEP0001] | &check; | &check; | **gzip** |
| | [crc32c] | [ZEP0002] | &check; | | **crc32c** |
| | [zstd] | [zarr-specs #256] | &check; | | zstd |
| | [bz2] (experimental) | | &check; | &check; | bz2 |

<sup>\* Bolded feature flags are part of the default set of features.</sup>
<br>
Expand All @@ -31,12 +33,16 @@
[crc32c]: crate::array::codec::bytes_to_bytes::crc32c
[zstd]: crate::array::codec::bytes_to_bytes::zstd
[bz2]: crate::array::codec::bytes_to_bytes::bz2
[vlen]: crate::array::codec::array_to_bytes::vlen
[vlen_v2]: crate::array::codec::array_to_bytes::vlen_v2

The `"name"` of of experimental codecs in array metadata links the codec documentation in this crate.

| Experimental Codec | Name / URI |
| ------------------ | ------------------------------------------------- |
| `bitround` | <https://codec.zarrs.dev/array_to_array/bitround> |
| `zfp` | <https://codec.zarrs.dev/array_to_bytes/zfp> |
| `pcodec` | <https://codec.zarrs.dev/array_to_bytes/pcodec> |
| `bz2` | <https://codec.zarrs.dev/bytes_to_bytes/bz2> |
| Experimental Codec | Name / URI |
| ------------------ | -------------------------------------------------------- |
| `bitround` | <https://codec.zarrs.dev/array_to_array/bitround> |
| `zfp` | <https://codec.zarrs.dev/array_to_bytes/zfp> |
| `pcodec` | <https://codec.zarrs.dev/array_to_bytes/pcodec> |
| `bz2` | <https://codec.zarrs.dev/bytes_to_bytes/bz2> |
| `vlen` | <https://codec.zarrs.dev/array_to_array/vlen> |
| `vlen_v2` | <https://codec.zarrs.dev/array_to_array/zfp_interleaved> |
9 changes: 8 additions & 1 deletion doc/status/data_types.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
| Data Type | ZEP | V3 | V2 | Feature Flag |
| Data Type<sup>†</sup> | ZEP | V3 | V2 | Feature Flag |
| --------- | --- | ----- | -- | ------------ |
| [bool]<br>[int8] [int16] [int32] [int64] [uint8] [uint16] [uint32] [uint64]<br>[float16] [float32] [float64]<br>[complex64] [complex128] | [ZEP0001] | &check; | &check; | |
[r* (raw bits)] | [ZEP0001] | &check; | | |
| [bfloat16] | [zarr-specs #130] | &check; | | |
| [string] (experimental) | [ZEP0007 (draft)] | &check; | | |
| [binary] (experimental) | [ZEP0007 (draft)] | &check; | | |

<sup>† Experimental data types are recommended for evaluation only.</sup>

[bool]: crate::array::data_type::DataType::Bool
[int8]: crate::array::data_type::DataType::Int8
Expand All @@ -20,6 +24,9 @@
[complex128]: crate::array::data_type::DataType::Complex128
[bfloat16]: crate::array::data_type::DataType::BFloat16
[r* (raw bits)]: crate::array::data_type::DataType::RawBits
[string]: crate::array::data_type::DataType::String
[binary]: crate::array::data_type::DataType::Binary

[ZEP0001]: https://zarr.dev/zeps/accepted/ZEP0001.html
[zarr-specs #130]: https://github.com/zarr-developers/zarr-specs/issues/130
[ZEP0007 (draft)]: https://github.com/zarr-developers/zeps/pull/47
119 changes: 119 additions & 0 deletions examples/array_write_read_string.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
use itertools::Itertools;
use ndarray::{array, Array2, ArrayD};
use zarrs::storage::{
storage_transformer::{StorageTransformerExtension, UsageLogStorageTransformer},
ReadableWritableListableStorage,
};

fn array_write_read() -> Result<(), Box<dyn std::error::Error>> {
use std::sync::Arc;
use zarrs::{
array::{DataType, FillValue},
array_subset::ArraySubset,
storage::store,
};

// Create a store
// let path = tempfile::TempDir::new()?;
// let mut store: ReadableWritableListableStorage = Arc::new(store::FilesystemStore::new(path.path())?);
// let mut store: ReadableWritableListableStorage = Arc::new(store::FilesystemStore::new(
// "tests/data/array_write_read.zarr",
// )?);
let mut store: ReadableWritableListableStorage = Arc::new(store::MemoryStore::new());
if let Some(arg1) = std::env::args().collect::<Vec<_>>().get(1) {
if arg1 == "--usage-log" {
let log_writer = Arc::new(std::sync::Mutex::new(
// std::io::BufWriter::new(
std::io::stdout(),
// )
));
let usage_log = Arc::new(UsageLogStorageTransformer::new(log_writer, || {
chrono::Utc::now().format("[%T%.3f] ").to_string()
}));
store = usage_log
.clone()
.create_readable_writable_listable_transformer(store);
}
}

// Create a group
let group_path = "/group";
let mut group = zarrs::group::GroupBuilder::new().build(store.clone(), group_path)?;

// Update group metadata
group
.attributes_mut()
.insert("foo".into(), serde_json::Value::String("bar".into()));

// Write group metadata to store
group.store_metadata()?;

println!(
"The group metadata is:\n{}\n",
serde_json::to_string_pretty(&group.metadata()).unwrap()
);

// Create an array
let array_path = "/group/array";
let array = zarrs::array::ArrayBuilder::new(
vec![4, 4], // array shape
DataType::String,
vec![2, 2].try_into()?, // regular chunk shape
FillValue::from("_"),
)
// .bytes_to_bytes_codecs(vec![]) // uncompressed
.dimension_names(["y", "x"].into())
// .storage_transformers(vec![].into())
.build(store.clone(), array_path)?;

// Write array metadata to store
array.store_metadata()?;

println!(
"The array metadata is:\n{}\n",
serde_json::to_string_pretty(&array.metadata()).unwrap()
);

// Write some chunks
array.store_chunk_ndarray(
&[0, 0],
ArrayD::<&str>::from_shape_vec(vec![2, 2], vec!["a", "bb", "ccc", "dddd"]).unwrap(),
)?;
array.store_chunk_ndarray(
&[0, 1],
ArrayD::<&str>::from_shape_vec(vec![2, 2], vec!["4444", "333", "22", "1"]).unwrap(),
)?;
let subset_all = ArraySubset::new_with_shape(array.shape().to_vec());
let data_all = array.retrieve_array_subset_ndarray::<String>(&subset_all)?;
println!("store_chunk [0, 0] and [0, 1]:\n{data_all}\n");

// Write a subset spanning multiple chunks, including updating chunks already written
let ndarray_subset: Array2<&str> = array![["!", "@@"], ["###", "$$$$"]];
array.store_array_subset_ndarray(
ArraySubset::new_with_ranges(&[1..3, 1..3]).start(),
ndarray_subset,
)?;
let data_all = array.retrieve_array_subset_ndarray::<String>(&subset_all)?;
println!("store_array_subset [1..3, 1..3]:\nndarray::ArrayD<String>\n{data_all}");

// Retrieve bytes directly, convert into a single string allocation, create a &str ndarray
// TODO: Add a convenience function for this?
let data_all = array.retrieve_array_subset(&subset_all)?;
let (bytes, offsets) = data_all.into_variable()?;
let string = String::from_utf8(bytes.into_owned())?;
let elements = offsets
.iter()
.tuple_windows()
.map(|(&curr, &next)| &string[curr..next])
.collect::<Vec<&str>>();
let ndarray = ArrayD::<&str>::from_shape_vec(subset_all.shape_usize(), elements)?;
println!("ndarray::ArrayD<&str>:\n{ndarray}");

Ok(())
}

fn main() {
if let Err(err) = array_write_read() {
println!("{:?}", err);
}
}
12 changes: 6 additions & 6 deletions examples/sharded_array_write_read.rs
Original file line number Diff line number Diff line change
Expand Up @@ -137,15 +137,15 @@ fn sharded_array_write_read() -> Result<(), Box<dyn std::error::Error>> {
ArraySubset::new_with_start_shape(vec![0, 4], inner_chunk_shape.clone())?,
];
let decoded_inner_chunks_bytes = partial_decoder.partial_decode(&inner_chunks_to_decode)?;
let decoded_inner_chunks_ndarray = decoded_inner_chunks_bytes
.into_iter()
.map(|bytes| bytes_to_ndarray::<u16>(&inner_chunk_shape, bytes.to_vec()))
.collect::<Result<Vec<_>, _>>()?;
println!("Decoded inner chunks:");
for (inner_chunk_subset, decoded_inner_chunk) in
std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_ndarray)
std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_bytes)
{
println!("{inner_chunk_subset}\n{decoded_inner_chunk}\n");
let ndarray = bytes_to_ndarray::<u16>(
&inner_chunk_shape,
decoded_inner_chunk.into_fixed()?.into_owned(),
)?;
println!("{inner_chunk_subset}\n{ndarray}\n");
}

// Show the hierarchy
Expand Down
Loading

0 comments on commit 649abe1

Please sign in to comment.