Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable length data types #40

Merged
merged 6 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 35 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,53 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- Add `ArrayBytes`, `RawBytes`, `RawBytesOffsets`, and `ArrayBytesError`
- These can represent array data with fixed and variable length data types
- Add `array::Element[Owned]` traits representing array elements
- Supports conversion to and from `ArrayBytes`
- Add `array::ElementFixedLength` marker trait
- Add experimental `vlen` and `vlen_v2` codec for variable length data types
- `vlen_v2` is for legacy support of Zarr V2 `vlen-utf8`/`vlen-bytes`/`vlen-array` codecs
- Add `DataType::{String,Binary}` data types
- These are likely to become standardised in the future and are not feature gated
- Add `ArraySubset::contains()`
- Add `FillValueMetadata::{String,Unsupported}`
- `ArrayMetadata` can be serialised and deserialised with an unsupported `fill_value`, but `Array` creation will fail.
- Implement `From<{[u8; N],&[u8; N],String,&str}>` for `FillValue`
- Add `ArraySize` and `DataTypeSize`
- Add `DataType::fixed_size()` that returns `Option<usize>`. Returns `None` for variable length data types.
- Add `ArrayError::IncompatibleElementType` (replaces `ArrayError::IncompatibleElementSize`)
- Add `ArrayError::InvalidElementValue`
- Add `ChunkShape::num_elements_u64`

### Changed
- Use `[async_]retrieve_array_subset_opt` internally in `Array::[async_]retrieve_chunks_opt`
- **Breaking**: Replace `[Async]ArrayPartialDecoderTraits::element_size()` with `data_type()`
- Array `_store` methods now use `impl Into<ArrayBytes<'a>>` instead of `&[u8]` for the input bytes
- **Breaking**: Array `_store_{elements,ndarray}` methods now use `T: Element` instead of `T: bytemuck::Pod`
- **Breaking**: Array `_retrieve_{elements,ndarray}` methods now use `T: ElementOwned` instead of `T: bytemuck::Pod`
- Optimised `Array::[async_]store_array_subset_opt` when the subset is a subset of a single chunk
- Make `transmute_to_bytes` public
- Relax `ndarray_into_vec` from `T: bytemuck:Pod` to `T: Clone`
- **Breaking**: `DataType::size()` now returns a `DataTypeSize` instead of `usize`
- **Breaking**: `ArrayCodecTraits::{encode/decode}` have been specialised into `ArrayTo{Array,Bytes}CodecTraits::{encode/decode}`

### Removed
- **Breaking**: Remove `into_array_view` array and codec API
- This was not fully utilised, not applicable to variable sized data types, and quite unsafe for a public API
- Remove internal `ChunksPerShardError` and just use `CodecError::Other`
- **Breaking**: Remove internal `ChunksPerShardError` and just use `CodecError::Other`
- **Breaking**: Remove `array_subset::{ArrayExtractBytesError,ArrayStoreBytesError}`
- **Breaking**: Remove `ArraySubset::{extract,store}_bytes[_unchecked]`, they are replaced by methods in `ArrayBytes`
- **Breaking**: Remove `array::validate_element_size` and `ArrayError::IncompatibleElementSize`
- The internal validation in array `_element` methods is now more strict than just matching the element size
- Example: `u16` must match `uint16` data type and will not match `int16` or `float16`

### Fixed
- Fix an unnecessary copy in `async_store_set_partial_values`
- Fix error when `bytes` metadata is encoded without a configuration, even if empty
- Fix an error in `ChunkGrid` docs
- Fixed `[async_]store_set_partial_values` and `MemoryStore::set` to correctly truncate the bytes of store value if they shrink

## [0.15.1] - 2024-07-11

Expand Down
11 changes: 10 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ async-lock = { version = "3.2.0", optional = true }
async-recursion = { version = "1.0.5", optional = true }
async-trait = { version = "0.1.74", optional = true }
blosc-sys = { version = "0.3.4", package = "blosc-src", features = ["snappy", "lz4", "zlib", "zstd"], optional = true }
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast"] }
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc", "must_cast", "min_const_generics"] }
bytes = "1.6.0"
bzip2 = { version = "0.4.4", optional = true, features = ["static"] }
crc32c = { version = "0.6.5", optional = true }
Expand Down Expand Up @@ -75,6 +75,10 @@ zfp-sys = {version = "0.1.15", features = ["static"], optional = true }
zip = { version = "2.1.3", optional = true }
zstd = { version = "0.13.1", features = ["zstdmt"], optional = true }

[dependencies.num-complex]
version = "0.4.3"
features = ["bytemuck"]

[dev-dependencies]
chrono = "0.4"
criterion = "0.5.1"
Expand All @@ -93,6 +97,11 @@ name = "array_write_read_ndarray"
required-features = ["ndarray"]
doc-scrape-examples = true

[[example]]
name = "array_write_read_string"
required-features = ["ndarray"]
doc-scrape-examples = true

[[example]]
name = "async_array_write_read"
required-features = ["ndarray", "async", "object_store"]
Expand Down
8 changes: 5 additions & 3 deletions benches/codecs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ use zarrs::array::{
codec::{
array_to_bytes::bytes::Endianness,
bytes_to_bytes::blosc::{BloscCompressor, BloscShuffleMode},
ArrayCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits, CodecOptions,
ArrayCodecTraits, ArrayToBytesCodecTraits, BloscCodec, BytesCodec, BytesToBytesCodecTraits,
CodecOptions,
},
BytesRepresentation, ChunkRepresentation, DataType,
BytesRepresentation, ChunkRepresentation, DataType, Element,
};

fn codec_bytes(c: &mut Criterion) {
Expand All @@ -35,12 +36,13 @@ fn codec_bytes(c: &mut Criterion) {
.unwrap();

let data = vec![0u8; size3.try_into().unwrap()];
let bytes = Element::into_array_bytes(&DataType::UInt8, &data).unwrap();
group.throughput(Throughput::Bytes(size3));
// encode and decode have the same implementation
group.bench_function(BenchmarkId::new("encode_decode", size3), |b| {
b.iter(|| {
codec
.encode(Cow::Borrowed(&data), &rep, &CodecOptions::default())
.encode(bytes.clone(), &rep, &CodecOptions::default())
.unwrap()
});
});
Expand Down
44 changes: 25 additions & 19 deletions doc/status/codecs.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
| Codec Type | Codec<sup>†</sup> | ZEP | V3 | V2 | Feature Flag* |
| -------------- | ------------------------- | ----------------- | ------- | ------- | ------------- |
| Array to Array | [transpose] | [ZEP0001] | &check; | | **transpose** |
| | [bitround] (experimental) | | &check; | | bitround |
| Array to Bytes | [bytes] | [ZEP0001] | &check; | | |
| | [sharding_indexed] | [ZEP0002] | &check; | | **sharding** |
| | [zfp] (experimental) | | &check; | | zfp |
| | [pcodec] (experimental) | | &check; | | pcodec |
| Bytes to Bytes | [blosc] | [ZEP0001] | &check; | &check; | **blosc** |
| | [gzip] | [ZEP0001] | &check; | &check; | **gzip** |
| | [crc32c] | [ZEP0002] | &check; | | **crc32c** |
| | [zstd] | [zarr-specs #256] | &check; | | zstd |
| | [bz2] (experimental) | | &check; | &check; | bz2 |
| Codec Type | Codec<sup>†</sup> | ZEP | V3 | V2 | Feature Flag* |
| -------------- | ------------------------------------------------- | ----------------- | ------- | ------- | ------------- |
| Array to Array | [transpose] | [ZEP0001] | &check; | | **transpose** |
| | [bitround] (experimental) | | &check; | | bitround |
| Array to Bytes | [bytes] | [ZEP0001] | &check; | | |
| | [sharding_indexed] | [ZEP0002] | &check; | | **sharding** |
| | [zfp] (experimental) | | &check; | | zfp |
| | [pcodec] (experimental) | | &check; | | pcodec |
| | [vlen] (experimental) | | &check; | | |
| | [vlen_v2] (experimental)<br>`vlen-*` in Zarr V2 | | &check; | &check; | |
| Bytes to Bytes | [blosc] | [ZEP0001] | &check; | &check; | **blosc** |
| | [gzip] | [ZEP0001] | &check; | &check; | **gzip** |
| | [crc32c] | [ZEP0002] | &check; | | **crc32c** |
| | [zstd] | [zarr-specs #256] | &check; | | zstd |
| | [bz2] (experimental) | | &check; | &check; | bz2 |

<sup>\* Bolded feature flags are part of the default set of features.</sup>
<br>
Expand All @@ -31,12 +33,16 @@
[crc32c]: crate::array::codec::bytes_to_bytes::crc32c
[zstd]: crate::array::codec::bytes_to_bytes::zstd
[bz2]: crate::array::codec::bytes_to_bytes::bz2
[vlen]: crate::array::codec::array_to_bytes::vlen
[vlen_v2]: crate::array::codec::array_to_bytes::vlen_v2

The `"name"` of of experimental codecs in array metadata links the codec documentation in this crate.

| Experimental Codec | Name / URI |
| ------------------ | ------------------------------------------------- |
| `bitround` | <https://codec.zarrs.dev/array_to_array/bitround> |
| `zfp` | <https://codec.zarrs.dev/array_to_bytes/zfp> |
| `pcodec` | <https://codec.zarrs.dev/array_to_bytes/pcodec> |
| `bz2` | <https://codec.zarrs.dev/bytes_to_bytes/bz2> |
| Experimental Codec | Name / URI |
| ------------------ | -------------------------------------------------------- |
| `bitround` | <https://codec.zarrs.dev/array_to_array/bitround> |
| `zfp` | <https://codec.zarrs.dev/array_to_bytes/zfp> |
| `pcodec` | <https://codec.zarrs.dev/array_to_bytes/pcodec> |
| `bz2` | <https://codec.zarrs.dev/bytes_to_bytes/bz2> |
| `vlen` | <https://codec.zarrs.dev/array_to_array/vlen> |
| `vlen_v2` | <https://codec.zarrs.dev/array_to_array/zfp_interleaved> |
9 changes: 8 additions & 1 deletion doc/status/data_types.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
| Data Type | ZEP | V3 | V2 | Feature Flag |
| Data Type<sup>†</sup> | ZEP | V3 | V2 | Feature Flag |
| --------- | --- | ----- | -- | ------------ |
| [bool]<br>[int8] [int16] [int32] [int64] [uint8] [uint16] [uint32] [uint64]<br>[float16] [float32] [float64]<br>[complex64] [complex128] | [ZEP0001] | &check; | &check; | |
[r* (raw bits)] | [ZEP0001] | &check; | | |
| [bfloat16] | [zarr-specs #130] | &check; | | |
| [string] (experimental) | [ZEP0007 (draft)] | &check; | | |
| [binary] (experimental) | [ZEP0007 (draft)] | &check; | | |

<sup>† Experimental data types are recommended for evaluation only.</sup>

[bool]: crate::array::data_type::DataType::Bool
[int8]: crate::array::data_type::DataType::Int8
Expand All @@ -20,6 +24,9 @@
[complex128]: crate::array::data_type::DataType::Complex128
[bfloat16]: crate::array::data_type::DataType::BFloat16
[r* (raw bits)]: crate::array::data_type::DataType::RawBits
[string]: crate::array::data_type::DataType::String
[binary]: crate::array::data_type::DataType::Binary

[ZEP0001]: https://zarr.dev/zeps/accepted/ZEP0001.html
[zarr-specs #130]: https://github.com/zarr-developers/zarr-specs/issues/130
[ZEP0007 (draft)]: https://github.com/zarr-developers/zeps/pull/47
119 changes: 119 additions & 0 deletions examples/array_write_read_string.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
use itertools::Itertools;
use ndarray::{array, Array2, ArrayD};
use zarrs::storage::{
storage_transformer::{StorageTransformerExtension, UsageLogStorageTransformer},
ReadableWritableListableStorage,
};

fn array_write_read() -> Result<(), Box<dyn std::error::Error>> {
use std::sync::Arc;
use zarrs::{
array::{DataType, FillValue},
array_subset::ArraySubset,
storage::store,
};

// Create a store
// let path = tempfile::TempDir::new()?;
// let mut store: ReadableWritableListableStorage = Arc::new(store::FilesystemStore::new(path.path())?);
// let mut store: ReadableWritableListableStorage = Arc::new(store::FilesystemStore::new(
// "tests/data/array_write_read.zarr",
// )?);
let mut store: ReadableWritableListableStorage = Arc::new(store::MemoryStore::new());
if let Some(arg1) = std::env::args().collect::<Vec<_>>().get(1) {
if arg1 == "--usage-log" {
let log_writer = Arc::new(std::sync::Mutex::new(
// std::io::BufWriter::new(
std::io::stdout(),
// )
));
let usage_log = Arc::new(UsageLogStorageTransformer::new(log_writer, || {
chrono::Utc::now().format("[%T%.3f] ").to_string()
}));
store = usage_log
.clone()
.create_readable_writable_listable_transformer(store);
}
}

// Create a group
let group_path = "/group";
let mut group = zarrs::group::GroupBuilder::new().build(store.clone(), group_path)?;

// Update group metadata
group
.attributes_mut()
.insert("foo".into(), serde_json::Value::String("bar".into()));

// Write group metadata to store
group.store_metadata()?;

println!(
"The group metadata is:\n{}\n",
serde_json::to_string_pretty(&group.metadata()).unwrap()
);

// Create an array
let array_path = "/group/array";
let array = zarrs::array::ArrayBuilder::new(
vec![4, 4], // array shape
DataType::String,
vec![2, 2].try_into()?, // regular chunk shape
FillValue::from("_"),
)
// .bytes_to_bytes_codecs(vec![]) // uncompressed
.dimension_names(["y", "x"].into())
// .storage_transformers(vec![].into())
.build(store.clone(), array_path)?;

// Write array metadata to store
array.store_metadata()?;

println!(
"The array metadata is:\n{}\n",
serde_json::to_string_pretty(&array.metadata()).unwrap()
);

// Write some chunks
array.store_chunk_ndarray(
&[0, 0],
ArrayD::<&str>::from_shape_vec(vec![2, 2], vec!["a", "bb", "ccc", "dddd"]).unwrap(),
)?;
array.store_chunk_ndarray(
&[0, 1],
ArrayD::<&str>::from_shape_vec(vec![2, 2], vec!["4444", "333", "22", "1"]).unwrap(),
)?;
let subset_all = ArraySubset::new_with_shape(array.shape().to_vec());
let data_all = array.retrieve_array_subset_ndarray::<String>(&subset_all)?;
println!("store_chunk [0, 0] and [0, 1]:\n{data_all}\n");

// Write a subset spanning multiple chunks, including updating chunks already written
let ndarray_subset: Array2<&str> = array![["!", "@@"], ["###", "$$$$"]];
array.store_array_subset_ndarray(
ArraySubset::new_with_ranges(&[1..3, 1..3]).start(),
ndarray_subset,
)?;
let data_all = array.retrieve_array_subset_ndarray::<String>(&subset_all)?;
println!("store_array_subset [1..3, 1..3]:\nndarray::ArrayD<String>\n{data_all}");

// Retrieve bytes directly, convert into a single string allocation, create a &str ndarray
// TODO: Add a convenience function for this?
let data_all = array.retrieve_array_subset(&subset_all)?;
let (bytes, offsets) = data_all.into_variable()?;
let string = String::from_utf8(bytes.into_owned())?;
let elements = offsets
.iter()
.tuple_windows()
.map(|(&curr, &next)| &string[curr..next])
.collect::<Vec<&str>>();
let ndarray = ArrayD::<&str>::from_shape_vec(subset_all.shape_usize(), elements)?;
println!("ndarray::ArrayD<&str>:\n{ndarray}");

Ok(())
}

fn main() {
if let Err(err) = array_write_read() {
println!("{:?}", err);
}
}
12 changes: 6 additions & 6 deletions examples/sharded_array_write_read.rs
Original file line number Diff line number Diff line change
Expand Up @@ -137,15 +137,15 @@ fn sharded_array_write_read() -> Result<(), Box<dyn std::error::Error>> {
ArraySubset::new_with_start_shape(vec![0, 4], inner_chunk_shape.clone())?,
];
let decoded_inner_chunks_bytes = partial_decoder.partial_decode(&inner_chunks_to_decode)?;
let decoded_inner_chunks_ndarray = decoded_inner_chunks_bytes
.into_iter()
.map(|bytes| bytes_to_ndarray::<u16>(&inner_chunk_shape, bytes.to_vec()))
.collect::<Result<Vec<_>, _>>()?;
println!("Decoded inner chunks:");
for (inner_chunk_subset, decoded_inner_chunk) in
std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_ndarray)
std::iter::zip(inner_chunks_to_decode, decoded_inner_chunks_bytes)
{
println!("{inner_chunk_subset}\n{decoded_inner_chunk}\n");
let ndarray = bytes_to_ndarray::<u16>(
&inner_chunk_shape,
decoded_inner_chunk.into_fixed()?.into_owned(),
)?;
println!("{inner_chunk_subset}\n{ndarray}\n");
}

// Show the hierarchy
Expand Down
Loading