Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/rename v2 metadata fields #2301

Merged
merged 15 commits into from
Oct 10, 2024

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 4, 2024

The first change is to alters the structure of ArrayV2Metadata to ensure that it matches the zarr v2 metadata document. This is simple:

  • I changed the data_type attribute to dtype
  • I changed the chunk_grid attribute to chunks

These changes make it difficult for ArrayV2Metadata to inherit from ArrayMetadata, because that class defines dtype as a property, which I learned cannot be overridden by an attribute. This observation made me rather skeptical of the utility of the ArrayMetadata class. See #2300 for more on that. So this PR removes the inheritance relationship between ArrayV2Metadata / ArrayV3Metadata and ArrayMetadata.

I could then replace the annotation of the metadata attribute of the AsyncArray class with ArrayV2Metadata | ArrayV3Metadata, but this type annotation is not quite accurate, because we know that some operations only result in an AsyncArray with ArrayV3Metadata, and others only make an AsyncArray with ArrayV2Metadata (namely, _create_v3 and _create_v2). So I attempted to solve this by making AsyncArray generic with a TypeVar with 2 bounds, ArrayV2Metadata and ArrayV3Metadata. This lets us express the invariances I mentioned above. Mypy was unhappy in various places and I plugged those holes with type: ignore statements. If there's a cleaner solution I would appreciate it.

This will remain a draft until we answer a few questions:

  • should we remove ArrayMetadata entirely?
  • can we make mypy happy without the type: ignores? maybe I'm doing something wrong with type annotations.
  • should we apply the same generic approach to AsyncGroup? To me, it's useful to represent the fact that AsyncGroup[V3] can only ever provide access to AsyncArray[V3] | AsyncGroup[V3], and similarly for v2. Using generic types seems like the simplest path to this. But maybe nobody else finds this useful, or the noisy type annotations are not worth it.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b requested a review from TomAugspurger October 4, 2024 02:57
@@ -344,7 +347,9 @@ async def _create_v3(

array = cls(metadata=metadata, store_path=store_path)
await array._save_metadata(metadata, ensure_parents=True)
return array
# type inference is inconsistent here and seems to conclude
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the issue might be from AsyncArray.metadata being set to metadata_parsed = parse_array_metadata(metadata), which doesn't know that parse_array_metadata(ArrayV2Metadata) should be ArrayV2Metadata. I thought that adding an overload would fix that:

@overload
def parse_array_metadata(data: ArrayV2Metadata) -> ArrayV2Metadata: ...


@overload
def parse_array_metadata(data: ArrayV3Metadata) -> ArrayV3Metadata: ...


@overload
def parse_array_metadata(data: dict[str, JSON]) -> ArrayV2Metadata | ArrayV3Metadata: ...


def parse_array_metadata(
    data: ArrayV2Metadata | ArrayV3Metadata | dict[str, JSON],
) -> ArrayV2Metadata | ArrayV3Metadata:

But no luck. I'll take a longer look later.

Copy link
Contributor

@TomAugspurger TomAugspurger Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple example that works:

from __future__ import annotations

import dataclasses
from typing import Generic, Literal, TypeVar, cast, overload, reveal_type


class A: ...

class B: ...

T = TypeVar("T", A, B)

@dataclasses.dataclass(frozen=True)
class Array(Generic[T]):
    version: T

    def __init__(self, version: A | B) -> None:
        object.__setattr__(self, "version", version)

    @overload
    @classmethod
    def create(cls, version: Literal[2]) -> Array[A]: ...

    @overload
    @classmethod
    def create(cls, version: Literal[3]) -> Array[B]: ...
 
    @overload
    @classmethod
    def create(cls, version: Literal[2, 3]) -> Array[A] | Array[B]: ...

    @classmethod
    def create(cls, version: Literal[2, 3]) -> Array[A] | Array[B]:
        if version == 2:
            return cls.create_2()
        elif version == 3:
            return cls.create_3()
        else:
            raise

    @classmethod
    def create_2(cls) -> Array[A]:
        return Array(version=A())

    @classmethod
    def create_3(cls) -> Array[B]:
        return Array(version=B())

reveal_type(Array(A()))
reveal_type(Array(B()))

reveal_type(Array.create_2())
reveal_type(Array.create_3())

reveal_type(Array.create(version=2))
reveal_type(Array.create(version=3))

That reveals the expected types everywhere:

❯ mypy check.py
check.py:49: note: Revealed type is "check.Array[check.A]"
check.py:50: note: Revealed type is "check.Array[check.A]"
check.py:52: note: Revealed type is "check.Array[check.A]"
check.py:53: note: Revealed type is "check.Array[check.B]"
check.py:55: note: Revealed type is "check.Array[check.A]"
check.py:56: note: Revealed type is "check.Array[check.B]"
Success: no issues found in 1 source file

I've tried making similar changes on your branch, but haven't gotten it working yet. I think the things that might matter are:

  1. We might need overloads for AsyncArray.create to ensure that AsyncArray.create(..., zarr_format=3) returns an AsyncArray[v3]. likewise for 2
  2. AsyncArray.__init__ also accepts a dict for metadata. The actual value set is the output of parse_arary_metadata. We need to ensure that the specialized type isn't lost (which the overload should do, but maybe I messed that up).

Copy link
Contributor

@TomAugspurger TomAugspurger Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what I messed up earlier, but these overloads on create seemed to do the trick (the only differences in the signature are zarr_format being Literal[2] or Literal[3] rather than Literal[2, 3], and the return type being specialized to v2 or v3):

diff --git a/src/zarr/core/array.py b/src/zarr/core/array.py
index 96431dec..5699935f 100644
--- a/src/zarr/core/array.py
+++ b/src/zarr/core/array.py
@@ -4,7 +4,7 @@ import json
 from asyncio import gather
 from dataclasses import dataclass, field, replace
 from logging import getLogger
-from typing import TYPE_CHECKING, Any, Generic, Literal, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Generic, Literal, TypeVar, cast, overload
 
 import numpy as np
 import numpy.typing as npt
@@ -191,6 +191,72 @@ class AsyncArray(Generic[TArrayMeta]):
         object.__setattr__(self, "order", order_parsed)
         object.__setattr__(self, "codec_pipeline", create_codec_pipeline(metadata=metadata_parsed))
 
+    @overload
+    @classmethod
+    async def create(
+        cls,
+        store: StoreLike,
+        *,
+        # v2 and v3
+        shape: ShapeLike,
+        dtype: npt.DTypeLike,
+        zarr_format: Literal[2],
+        fill_value: Any | None = None,
+        attributes: dict[str, JSON] | None = None,
+        # v3 only
+        chunk_shape: ChunkCoords | None = None,
+        chunk_key_encoding: (
+            ChunkKeyEncoding
+            | tuple[Literal["default"], Literal[".", "/"]]
+            | tuple[Literal["v2"], Literal[".", "/"]]
+            | None
+        ) = None,
+        codecs: Iterable[Codec | dict[str, JSON]] | None = None,
+        dimension_names: Iterable[str] | None = None,
+        # v2 only
+        chunks: ShapeLike | None = None,
+        dimension_separator: Literal[".", "/"] | None = None,
+        order: Literal["C", "F"] | None = None,
+        filters: list[dict[str, JSON]] | None = None,
+        compressor: dict[str, JSON] | None = None,
+        # runtime
+        exists_ok: bool = False,
+        data: npt.ArrayLike | None = None,
+    ) -> AsyncArray[ArrayV2Metadata]:...
+
+    @overload
+    @classmethod
+    async def create(
+        cls,
+        store: StoreLike,
+        *,
+        # v2 and v3
+        shape: ShapeLike,
+        dtype: npt.DTypeLike,
+        zarr_format: Literal[3],
+        fill_value: Any | None = None,
+        attributes: dict[str, JSON] | None = None,
+        # v3 only
+        chunk_shape: ChunkCoords | None = None,
+        chunk_key_encoding: (
+            ChunkKeyEncoding
+            | tuple[Literal["default"], Literal[".", "/"]]
+            | tuple[Literal["v2"], Literal[".", "/"]]
+            | None
+        ) = None,
+        codecs: Iterable[Codec | dict[str, JSON]] | None = None,
+        dimension_names: Iterable[str] | None = None,
+        # v2 only
+        chunks: ShapeLike | None = None,
+        dimension_separator: Literal[".", "/"] | None = None,
+        order: Literal["C", "F"] | None = None,
+        filters: list[dict[str, JSON]] | None = None,
+        compressor: dict[str, JSON] | None = None,
+        # runtime
+        exists_ok: bool = False,
+        data: npt.ArrayLike | None = None,
+    ) -> AsyncArray[ArrayV3Metadata]:...
+
     @classmethod
     async def create(
         cls,
@@ -349,7 +415,7 @@ class AsyncArray(Generic[TArrayMeta]):
         await array._save_metadata(metadata, ensure_parents=True)
         # type inference is inconsistent here and seems to conclude
         # that array has type Array[ArrayV2Metadata]
-        return array  # type: ignore[return-value]
+        return array
 
     @classmethod
     async def _create_v2(
@@ -388,7 +454,7 @@ class AsyncArray(Generic[TArrayMeta]):
         )
         array = cls(metadata=metadata, store_path=store_path)
         await array._save_metadata(metadata, ensure_parents=True)
-        return array  # type: ignore[return-value]
+        return array
 
     @classmethod
     def from_dict(

Checking that here:

file: check2.py
from typing import reveal_type

import numpy as np

from zarr.core.array import AsyncArray
from zarr.storage.common import StorePath
from zarr.storage.memory import MemoryStore


async def main() -> None:
    store = StorePath(MemoryStore(), "/")
    r2 = await AsyncArray._create_v2(store, shape=(0,), dtype=np.dtype("float64"), chunks=(0,))
    reveal_type(r2)  # v2

    r3 = await AsyncArray._create_v3(store, shape=(0,), dtype=np.dtype("float64"), chunk_shape=(0,))
    reveal_type(r3)  # v3

    rr2 = await AsyncArray.create(store, shape=(0,), dtype=np.dtype("float64"), chunks=(0,), zarr_format=2)
    reveal_type(rr2)  # v2

    rr3 = await AsyncArray.create(store, shape=(0,), dtype=np.dtype("float64"), chunks=(0,), zarr_format=3)
    reveal_type(rr3)  # v3

gives

❯ mypy check2.py
check2.py:13: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v2.ArrayV2Metadata]"
check2.py:16: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v3.ArrayV3Metadata]"
check2.py:19: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v2.ArrayV2Metadata]"
check2.py:22: note: Revealed type is "zarr.core.array.AsyncArray[zarr.core.metadata.v3.ArrayV3Metadata]"
Success: no issues found in 1 source file

@TomAugspurger
Copy link
Contributor

should we remove ArrayMetadata entirely?

That, or define ArrayMetadata = ArrayV2Metadata | ArrayV3Metadata for the few places where we write out that union.

should we apply the same generic approach to AsyncGroup? To me, it's useful to represent the fact that AsyncGroup[V3] can only ever provide access to AsyncArray[V3] | AsyncGroup[V3], and similarly for v2. Using generic types seems like the simplest path to this. But maybe nobody else finds this useful, or the noisy type annotations are not worth it.

Probably. I think being able to know that a Group[V3].getitem returns Group[V3] | Array[V3] and not Group[T] | Array[T] is useful.

@d-v-b d-v-b marked this pull request as ready for review October 8, 2024 17:09
@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 8, 2024

@TomAugspurger thanks for the suggestion to use @overload to fix the type inference for AsyncArray.create. A few other changes were necessary, namely
defining minimal TypedDict instances for the dict form of ArrayV2Metadata and ArrayV3Metadata. With these dict classes we can declare that creating AsyncArray from dicts like {'zarr_format': 2, ...} will produce AsyncArray[ArrayV2Metadata]. These classes will be replaced by the result of #2099.

This is blocked on some very confusing mypy errors. If anyone has ideas I would love to hear them!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 8, 2024

Looks like a mypy limitation at first glance:

With this

        resized = sync(self._async_array.resize(new_shape))
        reveal_type(resized)

mypy gives

src/zarr/core/array.py:2475: note: Revealed type is "builtins.object"

pylance / pyright get this one correct: Type of "resized" is "AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]".

python/mypy#15295 looks potentially related. I'd recommend casting the result of sync(self._async_array.resize(new_shape)) to AsyncArray to help mypy out.

@rabernat
Copy link
Contributor

rabernat commented Oct 8, 2024

Mypy errors notwithstanding, does this fix #2269?

@TomAugspurger
Copy link
Contributor

Not quite. One this branch that raises with

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/gh/zarr-developers/zarr-python/src/zarr/core/common.py:142, in parse_shapelike(data)
    141 try:
--> 142     data_tuple = tuple(data)
    143 except TypeError as e:

TypeError: 'RegularChunkGrid' object is not iterable

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[4], line 16
      7 data = np.arange(0, 8, dtype="uint16")
      8 a = Array.create(
      9     store / "simple_v2",
     10     zarr_format=2,
   (...)
     14     fill_value=0,
     15 )
---> 16 a.attrs.put({"key": 0})

File ~/gh/zarr-developers/zarr-python/src/zarr/core/attributes.py:53, in Attributes.put(self, d)
     39 def put(self, d: dict[str, JSON]) -> None:
     40     """
     41     Overwrite all attributes with the values from `d`.
     42
   (...)
     51        {'a': 3, 'c': 4}
     52     """
---> 53     self._obj = self._obj.update_attributes(d)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/array.py:2478, in Array.update_attributes(self, new_attributes)
   2477 def update_attributes(self, new_attributes: dict[str, JSON]) -> Array:
-> 2478     return type(self)(sync(self._async_array.update_attributes(new_attributes)))

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:91, in sync(coro, loop, timeout)
     88 return_result = next(iter(finished)).result()
     90 if isinstance(return_result, BaseException):
---> 91     raise return_result
     92 else:
     93     return return_result

File ~/gh/zarr-developers/zarr-python/src/zarr/core/sync.py:50, in _runner(coro)
     45 """
     46 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
     47 exception, the exception will be returned.
     48 """
     49 try:
---> 50     return await coro
     51 except Exception as ex:
     52     return ex

File ~/gh/zarr-developers/zarr-python/src/zarr/core/array.py:913, in AsyncArray.update_attributes(self, new_attributes)
    912 async def update_attributes(self, new_attributes: dict[str, JSON]) -> Self:
--> 913     new_metadata = self.metadata.update_attributes(new_attributes)
    915     # Write new metadata
    916     await self._save_metadata(new_metadata)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v2.py:208, in ArrayV2Metadata.update_attributes(self, attributes)
    207 def update_attributes(self, attributes: dict[str, JSON]) -> Self:
--> 208     return replace(self, attributes=attributes)

File ~/mambaforge/envs/python=3.12/lib/python3.12/dataclasses.py:1588, in replace(obj, **changes)
   1581         changes[f.name] = getattr(obj, f.name)
   1583 # Create the new object, which calls __init__() and
   1584 # __post_init__() (if defined), using all of the init fields we've
   1585 # added and/or left in 'changes'.  If there are values supplied in
   1586 # changes that aren't fields, this will correctly raise a
   1587 # TypeError.
-> 1588 return obj.__class__(**changes)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v2.py:72, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
     70 shape_parsed = parse_shapelike(shape)
     71 dtype_parsed = parse_dtype(dtype)
---> 72 chunks_parsed = parse_shapelike(chunks)
     73 compressor_parsed = parse_compressor(compressor)
     74 order_parsed = parse_indexing_order(order)

File ~/gh/zarr-developers/zarr-python/src/zarr/core/common.py:145, in parse_shapelike(data)
    143 except TypeError as e:
    144     msg = f"Expected an integer or an iterable of integers. Got {data} instead."
--> 145     raise TypeError(msg) from e
    147 if not all(isinstance(v, int) for v in data_tuple):
    148     msg = f"Expected an iterable of integers. Got {data} instead."

TypeError: Expected an integer or an iterable of integers. Got RegularChunkGrid(chunk_shape=(4,)) instead.

We'll need to update the various parse_* functions to gracefully handle the already parsed versions.

@d-v-b this is one of the reasons in our serialization discussion (#2144) I mentioned preferring plain __init__ methods that simply set the attributes (plus a tiny bit of validation that type checkers can't catch, like ensuring that the length of chunks and shape match). Because __init__ always runs anytime you create an instance I like to keep them simple, and "complicated" stuff like parsing tuples of ints into their concrete form can be done in alternative constructors / on the boundary. But it does open up the risk of people using the class constructor incorrectly and later hitting an error / incorrect behavior (maybe, if they have code that relies on that, but that's just what you get with duck typing).

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 9, 2024

@TomAugspurger that's a good point! I think there's a significant difference between "parse the inputs to ensure validity" and "parse the inputs to ensure validity, and return a new data type that's not part of the metadata model". By turning the chunks attribute from a JSON array of numbers to a RegularChunkGrid object we were doing the second one, when I think we should always do the first one.

So I altered ArrayV2Metadata such that ArrayV2Metadata.chunks is a plain tuple of ints, and ArrayV2Metadata.chunk_grid is a cached property that returns a RegularChunkGrid. I also added some casts to deal with the type inference issues. I checked the test case from #2269 and it passed with these changes.

@d-v-b d-v-b requested a review from TomAugspurger October 9, 2024 14:45
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this looks good overall.

else:
metadata = ArrayV3Metadata.from_dict(metadata)
raise ValueError(f"Invalid zarr_format: {zarr_format}. Expected 2 or 3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2310 (not yet merged) is adding some special exceptions for parsing invalid metadata, which we could use here if we want.

@TomAugspurger TomAugspurger added the downstream Downstream libraries using zarr label Oct 9, 2024
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 1000% sold on some of the typing changes (particularly the AsyncArray generic) but this is a solid improvement so my questions are non-blocking.

@@ -198,7 +197,7 @@ async def open(
path: str | None = None,
storage_options: dict[str, Any] | None = None,
**kwargs: Any, # TODO: type kwargs as valid args to open_array
) -> AsyncArray | AsyncGroup:
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata] | AsyncGroup:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question applies here and everywhere else you use the v2/v3 Metadata types.

Since the generic only includes two variants (ArrayV2Metadata and ArrayV3Metadata), what value do we get by using both together here (as apposed to simply using the AsyncArray type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is purely to silence mypy complaining about a generic missing its type parameter. If there's a cleaner way to annotate it then I'll do it. Would AsyncArray[Any] work there? (I can play around with this locally)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

playing around with AsyncArray[Any] wasn't satisfactory. I think the only shortcut we can use here is assigning a type alias to the union AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata], and I'm not sure that's worth it.

Comment on lines 169 to 173
TArrayMeta = TypeVar("TArrayMeta", ArrayV2Metadata, ArrayV3Metadata)


@dataclass(frozen=True)
class AsyncArray:
metadata: ArrayMetadata
class AsyncArray(Generic[TArrayMeta]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this foreclose the opportunity to use other generics here (namely over dtype)?

xref: #2137 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good question, I have no idea! It seems like we would want to make the metadata class generic w.r.t dtype, but then have the array class be generic over dtype too... not sure how to solve that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bit about this, and I don't see why we couldn't make AsyncArray inherit from Generic[TArrayMeta, TDtype].

…/zarr-python into refactor/rename-v2-metadata-fields
@jhamman jhamman added this to the 3.0.0.beta milestone Oct 9, 2024
@jhamman jhamman added the V3 label Oct 9, 2024
@jhamman jhamman merged commit 6b11bb8 into zarr-developers:v3 Oct 10, 2024
20 checks passed
@d-v-b d-v-b deleted the refactor/rename-v2-metadata-fields branch November 6, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
downstream Downstream libraries using zarr
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants