-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/rename v2 metadata fields #2301
Refactor/rename v2 metadata fields #2301
Conversation
src/zarr/core/array.py
Outdated
@@ -344,7 +347,9 @@ async def _create_v3( | |||
|
|||
array = cls(metadata=metadata, store_path=store_path) | |||
await array._save_metadata(metadata, ensure_parents=True) | |||
return array | |||
# type inference is inconsistent here and seems to conclude |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the issue might be from AsyncArray.metadata
being set to metadata_parsed = parse_array_metadata(metadata)
, which doesn't know that parse_array_metadata(ArrayV2Metadata)
should be ArrayV2Metadata
. I thought that adding an overload would fix that:
@overload
def parse_array_metadata(data: ArrayV2Metadata) -> ArrayV2Metadata: ...
@overload
def parse_array_metadata(data: ArrayV3Metadata) -> ArrayV3Metadata: ...
@overload
def parse_array_metadata(data: dict[str, JSON]) -> ArrayV2Metadata | ArrayV3Metadata: ...
def parse_array_metadata(
data: ArrayV2Metadata | ArrayV3Metadata | dict[str, JSON],
) -> ArrayV2Metadata | ArrayV3Metadata:
But no luck. I'll take a longer look later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a simple example that works:
from __future__ import annotations
import dataclasses
from typing import Generic, Literal, TypeVar, cast, overload, reveal_type
class A: ...
class B: ...
T = TypeVar("T", A, B)
@dataclasses.dataclass(frozen=True)
class Array(Generic[T]):
version: T
def __init__(self, version: A | B) -> None:
object.__setattr__(self, "version", version)
@overload
@classmethod
def create(cls, version: Literal[2]) -> Array[A]: ...
@overload
@classmethod
def create(cls, version: Literal[3]) -> Array[B]: ...
@overload
@classmethod
def create(cls, version: Literal[2, 3]) -> Array[A] | Array[B]: ...
@classmethod
def create(cls, version: Literal[2, 3]) -> Array[A] | Array[B]:
if version == 2:
return cls.create_2()
elif version == 3:
return cls.create_3()
else:
raise
@classmethod
def create_2(cls) -> Array[A]:
return Array(version=A())
@classmethod
def create_3(cls) -> Array[B]:
return Array(version=B())
reveal_type(Array(A()))
reveal_type(Array(B()))
reveal_type(Array.create_2())
reveal_type(Array.create_3())
reveal_type(Array.create(version=2))
reveal_type(Array.create(version=3))
That reveals the expected types everywhere:
❯ mypy check.py
check.py:49: note: Revealed type is "check.Array[check.A]"
check.py:50: note: Revealed type is "check.Array[check.A]"
check.py:52: note: Revealed type is "check.Array[check.A]"
check.py:53: note: Revealed type is "check.Array[check.B]"
check.py:55: note: Revealed type is "check.Array[check.A]"
check.py:56: note: Revealed type is "check.Array[check.B]"
Success: no issues found in 1 source file
I've tried making similar changes on your branch, but haven't gotten it working yet. I think the things that might matter are:
- We might need overloads for
AsyncArray.create
to ensure thatAsyncArray.create(..., zarr_format=3)
returns anAsyncArray[v3]
. likewise for 2 AsyncArray.__init__
also accepts adict
formetadata
. The actual value set is the output ofparse_arary_metadata
. We need to ensure that the specialized type isn't lost (which the overload should do, but maybe I messed that up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what I messed up earlier, but these overloads on create
seemed to do the trick (the only differences in the signature are zarr_format
being Literal[2]
or Literal[3]
rather than Literal[2, 3]
, and the return type being specialized to v2 or v3):
diff --git a/src/zarr/core/array.py b/src/zarr/core/array.py
index 96431dec..5699935f 100644
--- a/src/zarr/core/array.py
+++ b/src/zarr/core/array.py
@@ -4,7 +4,7 @@ import json
from asyncio import gather
from dataclasses import dataclass, field, replace
from logging import getLogger
-from typing import TYPE_CHECKING, Any, Generic, Literal, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Generic, Literal, TypeVar, cast, overload
import numpy as np
import numpy.typing as npt
@@ -191,6 +191,72 @@ class AsyncArray(Generic[TArrayMeta]):
object.__setattr__(self, "order", order_parsed)
object.__setattr__(self, "codec_pipeline", create_codec_pipeline(metadata=metadata_parsed))
+ @overload
+ @classmethod
+ async def create(
+ cls,
+ store: StoreLike,
+ *,
+ # v2 and v3
+ shape: ShapeLike,
+ dtype: npt.DTypeLike,
+ zarr_format: Literal[2],
+ fill_value: Any | None = None,
+ attributes: dict[str, JSON] | None = None,
+ # v3 only
+ chunk_shape: ChunkCoords | None = None,
+ chunk_key_encoding: (
+ ChunkKeyEncoding
+ | tuple[Literal["default"], Literal[".", "/"]]
+ | tuple[Literal["v2"], Literal[".", "/"]]
+ | None
+ ) = None,
+ codecs: Iterable[Codec | dict[str, JSON]] | None = None,
+ dimension_names: Iterable[str] | None = None,
+ # v2 only
+ chunks: ShapeLike | None = None,
+ dimension_separator: Literal[".", "/"] | None = None,
+ order: Literal["C", "F"] | None = None,
+ filters: list[dict[str, JSON]] | None = None,
+ compressor: dict[str, JSON] | None = None,
+ # runtime
+ exists_ok: bool = False,
+ data: npt.ArrayLike | None = None,
+ ) -> AsyncArray[ArrayV2Metadata]:...
+
+ @overload
+ @classmethod
+ async def create(
+ cls,
+ store: StoreLike,
+ *,
+ # v2 and v3
+ shape: ShapeLike,
+ dtype: npt.DTypeLike,
+ zarr_format: Literal[3],
+ fill_value: Any | None = None,
+ attributes: dict[str, JSON] | None = None,
+ # v3 only
+ chunk_shape: ChunkCoords | None = None,
+ chunk_key_encoding: (
+ ChunkKeyEncoding
+ | tuple[Literal["default"], Literal[".", "/"]]
+ | tuple[Literal["v2"], Literal[".", "/"]]
+ | None
+ ) = None,
+ codecs: Iterable[Codec | dict[str, JSON]] | None = None,
+ dimension_names: Iterable[str] | None = None,
+ # v2 only
+ chunks: ShapeLike | None = None,
+ dimension_separator: Literal[".", "/"] | None = None,
+ order: Literal["C", "F"] | None = None,
+ filters: list[dict[str, JSON]] | None = None,
+ compressor: dict[str, JSON] | None = None,
+ # runtime
+ exists_ok: bool = False,
+ data: npt.ArrayLike | None = None,
+ ) -> AsyncArray[ArrayV3Metadata]:...
+
@classmethod
async def create(
cls,
@@ -349,7 +415,7 @@ class AsyncArray(Generic[TArrayMeta]):
await array._save_metadata(metadata, ensure_parents=True)
# type inference is inconsistent here and seems to conclude
# that array has type Array[ArrayV2Metadata]
- return array # type: ignore[return-value]
+ return array
@classmethod
async def _create_v2(
@@ -388,7 +454,7 @@ class AsyncArray(Generic[TArrayMeta]):
)
array = cls(metadata=metadata, store_path=store_path)
await array._save_metadata(metadata, ensure_parents=True)
- return array # type: ignore[return-value]
+ return array
@classmethod
def from_dict(
Checking that here:
file: check2.py
from typing import reveal_type
import numpy as np
from zarr.core.array import AsyncArray
from zarr.storage.common import StorePath
from zarr.storage.memory import MemoryStore
async def main() -> None:
store = StorePath(MemoryStore(), "/")
r2 = await AsyncArray._create_v2(store, shape=(0,), dtype=np.dtype("float64"), chunks=(0,))
reveal_type(r2) # v2
r3 = await AsyncArray._create_v3(store, shape=(0,), dtype=np.dtype("float64"), chunk_shape=(0,))
reveal_type(r3) # v3
rr2 = await AsyncArray.create(store, shape=(0,), dtype=np.dtype("float64"), chunks=(0,), zarr_format=2)
reveal_type(rr2) # v2
rr3 = await AsyncArray.create(store, shape=(0,), dtype=np.dtype("float64"), chunks=(0,), zarr_format=3)
reveal_type(rr3) # v3
gives
❯ mypy check2.py
check2.py:13: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v2.ArrayV2Metadata]"
check2.py:16: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v3.ArrayV3Metadata]"
check2.py:19: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v2.ArrayV2Metadata]"
check2.py:22: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v3.ArrayV3Metadata]"
Success: no issues found in 1 source file
That, or define
Probably. I think being able to know that a |
@TomAugspurger thanks for the suggestion to use This is blocked on some very confusing mypy errors. If anyone has ideas I would love to hear them! |
Looks like a mypy limitation at first glance: With this
mypy gives
pylance / pyright get this one correct: python/mypy#15295 looks potentially related. I'd recommend casting the result of |
Mypy errors notwithstanding, does this fix #2269? |
Not quite. One this branch that raises with ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/gh/zarr-developers/zarr-python/src/zarr/core/common.py:142, in parse_shapelike(data)
141 try:
--> 142 data_tuple = tuple(data)
143 except TypeError as e:
TypeError: 'RegularChunkGrid' object is not iterable
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[4], line 16
7 data = np.arange(0, 8, dtype="uint16")
8 a = Array.create(
9 store / "simple_v2",
10 zarr_format=2,
(...)
14 fill_value=0,
15 )
---> 16 a.attrs.put({"key": 0})
File ~/gh/zarr-developers/zarr-python/src/zarr/core/attributes.py:53, in Attributes.put(self, d)
39 def put(self, d: dict[str, JSON]) -> None:
40 """
41 Overwrite all attributes with the values from `d`.
42
(...)
51 {'a': 3, 'c': 4}
52 """
---> 53 self._obj = self._obj.update_attributes(d)
File ~/gh/zarr-developers/zarr-python/src/zarr/core/array.py:2478, in Array.update_attributes(self, new_attributes)
2477 def update_attributes(self, new_attributes: dict[str, JSON]) -> Array:
-> 2478 return type(self)(sync(self._async_array.update_attributes(new_attributes)))
File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:91, in sync(coro, loop, timeout)
88 return_result = next(iter(finished)).result()
90 if isinstance(return_result, BaseException):
---> 91 raise return_result
92 else:
93 return return_result
File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:50, in _runner(coro)
45 """
46 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
47 exception, the exception will be returned.
48 """
49 try:
---> 50 return await coro
51 except Exception as ex:
52 return ex
File ~/gh/zarr-developers/zarr-python/src/zarr/core/array.py:913, in AsyncArray.update_attributes(self, new_attributes)
912 async def update_attributes(self, new_attributes: dict[str, JSON]) -> Self:
--> 913 new_metadata = self.metadata.update_attributes(new_attributes)
915 # Write new metadata
916 await self._save_metadata(new_metadata)
File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v2.py:208, in ArrayV2Metadata.update_attributes(self, attributes)
207 def update_attributes(self, attributes: dict[str, JSON]) -> Self:
--> 208 return replace(self, attributes=attributes)
File ~/mambaforge/envs/python=3.12/lib/python3.12/dataclasses.py:1588, in replace(obj, **changes)
1581 changes[f.name] = getattr(obj, f.name)
1583 # Create the new object, which calls __init__() and
1584 # __post_init__() (if defined), using all of the init fields we've
1585 # added and/or left in 'changes'. If there are values supplied in
1586 # changes that aren't fields, this will correctly raise a
1587 # TypeError.
-> 1588 return obj.__class__(**changes)
File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v2.py:72, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
70 shape_parsed = parse_shapelike(shape)
71 dtype_parsed = parse_dtype(dtype)
---> 72 chunks_parsed = parse_shapelike(chunks)
73 compressor_parsed = parse_compressor(compressor)
74 order_parsed = parse_indexing_order(order)
File ~/gh/zarr-developers/zarr-python/src/zarr/core/common.py:145, in parse_shapelike(data)
143 except TypeError as e:
144 msg = f"Expected an integer or an iterable of integers. Got {data} instead."
--> 145 raise TypeError(msg) from e
147 if not all(isinstance(v, int) for v in data_tuple):
148 msg = f"Expected an iterable of integers. Got {data} instead."
TypeError: Expected an integer or an iterable of integers. Got RegularChunkGrid(chunk_shape=(4,)) instead. We'll need to update the various @d-v-b this is one of the reasons in our serialization discussion (#2144) I mentioned preferring plain |
…nto refactor/rename-v2-metadata-fields
@TomAugspurger that's a good point! I think there's a significant difference between "parse the inputs to ensure validity" and "parse the inputs to ensure validity, and return a new data type that's not part of the metadata model". By turning the So I altered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this looks good overall.
else: | ||
metadata = ArrayV3Metadata.from_dict(metadata) | ||
raise ValueError(f"Invalid zarr_format: {zarr_format}. Expected 2 or 3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2310 (not yet merged) is adding some special exceptions for parsing invalid metadata, which we could use here if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 1000% sold on some of the typing changes (particularly the AsyncArray generic) but this is a solid improvement so my questions are non-blocking.
@@ -198,7 +197,7 @@ async def open( | |||
path: str | None = None, | |||
storage_options: dict[str, Any] | None = None, | |||
**kwargs: Any, # TODO: type kwargs as valid args to open_array | |||
) -> AsyncArray | AsyncGroup: | |||
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question applies here and everywhere else you use the v2/v3 Metadata types.
Since the generic only includes two variants (ArrayV2Metadata
and ArrayV3Metadata
), what value do we get by using both together here (as apposed to simply using the AsyncArray
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is purely to silence mypy complaining about a generic missing its type parameter. If there's a cleaner way to annotate it then I'll do it. Would AsyncArray[Any]
work there? (I can play around with this locally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
playing around with AsyncArray[Any]
wasn't satisfactory. I think the only shortcut we can use here is assigning a type alias to the union AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]
, and I'm not sure that's worth it.
src/zarr/core/array.py
Outdated
TArrayMeta = TypeVar("TArrayMeta", ArrayV2Metadata, ArrayV3Metadata) | ||
|
||
|
||
@dataclass(frozen=True) | ||
class AsyncArray: | ||
metadata: ArrayMetadata | ||
class AsyncArray(Generic[TArrayMeta]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this foreclose the opportunity to use other generics here (namely over dtype)?
xref: #2137 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good question, I have no idea! It seems like we would want to make the metadata class generic w.r.t dtype, but then have the array class be generic over dtype too... not sure how to solve that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a bit about this, and I don't see why we couldn't make AsyncArray
inherit from Generic[TArrayMeta, TDtype]
.
…/zarr-python into refactor/rename-v2-metadata-fields
The first change is to alters the structure of
ArrayV2Metadata
to ensure that it matches the zarr v2 metadata document. This is simple:data_type
attribute todtype
chunk_grid
attribute tochunks
These changes make it difficult for
ArrayV2Metadata
to inherit fromArrayMetadata
, because that class definesdtype
as a property, which I learned cannot be overridden by an attribute. This observation made me rather skeptical of the utility of theArrayMetadata
class. See #2300 for more on that. So this PR removes the inheritance relationship betweenArrayV2Metadata
/ArrayV3Metadata
andArrayMetadata
.I could then replace the annotation of the
metadata
attribute of theAsyncArray
class withArrayV2Metadata | ArrayV3Metadata
, but this type annotation is not quite accurate, because we know that some operations only result in anAsyncArray
withArrayV3Metadata
, and others only make anAsyncArray
withArrayV2Metadata
(namely,_create_v3
and_create_v2
). So I attempted to solve this by makingAsyncArray
generic with aTypeVar
with 2 bounds,ArrayV2Metadata
andArrayV3Metadata
. This lets us express the invariances I mentioned above. Mypy was unhappy in various places and I plugged those holes withtype: ignore
statements. If there's a cleaner solution I would appreciate it.This will remain a draft until we answer a few questions:
ArrayMetadata
entirely?type: ignores
? maybe I'm doing something wrong with type annotations.AsyncGroup
? To me, it's useful to represent the fact thatAsyncGroup[V3]
can only ever provide access toAsyncArray[V3] | AsyncGroup[V3]
, and similarly for v2. Using generic types seems like the simplest path to this. But maybe nobody else finds this useful, or the noisy type annotations are not worth it.TODO: