This repository has been archived by the owner on Nov 12, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Write virtual references to Icechunk #1
Closed
Closed
Changes from 26 commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
7b00e41
move vds_with_manifest_arrays fixture up
TomNicholas c82221c
sketch implementation
TomNicholas d29362b
test that we can create an icechunk store
TomNicholas 2aa3cb5
fixture to create icechunk filestore in temporary directory
TomNicholas f2c095c
get the async fixture working properly
TomNicholas 6abe32d
split into more functions
TomNicholas 93080b3
change mode
TomNicholas bebf370
try creating zarr group and arrays explicitly
TomNicholas 833e5f0
create root group from store
TomNicholas 9853140
todos
TomNicholas 030a96e
do away with the async pytest fixtures/functions
TomNicholas 90ca5cf
successfully writes root group attrs
TomNicholas b138dde
check array metadata is correct
TomNicholas 6631102
try to write array attributes
TomNicholas e92b56c
sketch test for checking virtual references have been set correctly
TomNicholas 2c8c0ee
test setting single virtual ref
TomNicholas a2ce1ed
use async properly
TomNicholas 9393995
better separation of handling of loadable variables
TomNicholas 956e324
fix chunk key format
TomNicholas 2d7d5f6
use require_array
TomNicholas 8726e23
check that store supports writes
TomNicholas 387f345
removed outdated note about awaiting
TomNicholas b2a0700
fix incorrect chunk key in test
TomNicholas 4ffb55e
absolute path
TomNicholas f929fcb
convert to file URI before handing to icechunk
TomNicholas e9c1287
test that without encoding we can definitely read one chunk
TomNicholas 2fe3012
Work on encoding test
mpiannucci 33d8ce8
Merge remote-tracking branch 'origin/icechunk' into matt/icechunk-enc…
mpiannucci 8aa6034
Update test to match
mpiannucci aa2d415
Quick comment
mpiannucci 7e4e2ce
more comprehensive
mpiannucci 9a03245
add attrtirbute encoding
mpiannucci 9676485
Merge pull request #2 from earth-mover/matt/icechunk-encoding
TomNicholas bbaf3ba
Fix array dimensions
mpiannucci 31945cd
Merge pull request #3 from earth-mover/matt/array-dims
mpiannucci 49daa7e
Fix v3 codec pipeline
mpiannucci 756ff92
Put xarray dep back
mpiannucci 8c7242e
Handle codecs, but get bad results
mpiannucci 666b676
Gzip an d zlib are not directly working
mpiannucci 9076ad7
Get up working with numcodecs zarr 3 codecs
mpiannucci 7a160fd
Update codec pipeline
mpiannucci 286a9b5
Merge pull request #4 from earth-mover/matt/v3-codecs
mpiannucci 8f1f96e
oUdpate to latest icechunk using sync api
mpiannucci File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
import numpy as np | ||
import pytest | ||
from xarray import Dataset | ||
from xarray.core.variable import Variable | ||
|
||
from virtualizarr.manifests import ChunkManifest, ManifestArray | ||
|
||
|
||
@pytest.fixture | ||
def vds_with_manifest_arrays() -> Dataset: | ||
arr = ManifestArray( | ||
chunkmanifest=ChunkManifest( | ||
entries={"0.0": dict(path="/test.nc", offset=6144, length=48)} | ||
), | ||
zarray=dict( | ||
shape=(2, 3), | ||
dtype=np.dtype("<i8"), | ||
chunks=(2, 3), | ||
compressor={"id": "zlib", "level": 1}, | ||
filters=None, | ||
fill_value=0, | ||
order="C", | ||
zarr_format=3, | ||
), | ||
) | ||
var = Variable(dims=["x", "y"], data=arr, attrs={"units": "km"}) | ||
return Dataset({"a": var}, attrs={"something": 0}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
import asyncio | ||
from pathlib import Path | ||
from typing import TYPE_CHECKING | ||
|
||
import pytest | ||
|
||
pytest.importorskip("icechunk") | ||
|
||
import numpy as np | ||
import numpy.testing as npt | ||
from xarray import Dataset, open_dataset | ||
from xarray.core.variable import Variable | ||
from zarr import Array, Group, group | ||
|
||
from virtualizarr.manifests import ChunkManifest, ManifestArray | ||
from virtualizarr.writers.icechunk import dataset_to_icechunk | ||
from virtualizarr.zarr import ZArray | ||
|
||
if TYPE_CHECKING: | ||
from icechunk import IcechunkStore | ||
|
||
|
||
@pytest.fixture | ||
def icechunk_filestore(tmpdir) -> "IcechunkStore": | ||
from icechunk import IcechunkStore, StorageConfig | ||
|
||
storage = StorageConfig.filesystem(str(tmpdir)) | ||
|
||
# TODO if icechunk exposed a synchronous version of .open then we wouldn't need to use asyncio.run here | ||
# TODO is this the correct mode to use? | ||
store = asyncio.run(IcechunkStore.open(storage=storage, mode="r+")) | ||
|
||
# TODO instead yield store then store.close() ?? | ||
return store | ||
|
||
|
||
class TestWriteVirtualRefs: | ||
def test_write_new_virtual_variable( | ||
self, icechunk_filestore: "IcechunkStore", vds_with_manifest_arrays: Dataset | ||
): | ||
vds = vds_with_manifest_arrays | ||
|
||
dataset_to_icechunk(vds, icechunk_filestore) | ||
|
||
# check attrs | ||
root_group = group(store=icechunk_filestore) | ||
assert isinstance(root_group, Group) | ||
assert root_group.attrs == {"something": 0} | ||
|
||
# TODO check against vds, then perhaps parametrize? | ||
|
||
# check array exists | ||
assert "a" in root_group | ||
arr = root_group["a"] | ||
assert isinstance(arr, Array) | ||
|
||
# check array metadata | ||
# TODO why doesn't a .zarr_format or .version attribute exist on zarr.Array? | ||
# assert arr.zarr_format == 3 | ||
assert arr.shape == (2, 3) | ||
assert arr.chunks == (2, 3) | ||
assert arr.dtype == np.dtype("<i8") | ||
assert arr.order == "C" | ||
assert arr.fill_value == 0 | ||
# TODO check compressor, filters? | ||
|
||
# check array attrs | ||
# TODO somehow this is broken by setting the dimension names??? | ||
# assert dict(arr.attrs) == {"units": "km"} | ||
|
||
# check dimensions | ||
assert arr.attrs["DIMENSION_NAMES"] == ["x", "y"] | ||
|
||
def test_set_single_virtual_ref_without_encoding( | ||
self, icechunk_filestore: "IcechunkStore", simple_netcdf4: Path | ||
): | ||
# TODO kerchunk doesn't work with zarr-python v3 yet so we can't use open_virtual_dataset and icechunk together! | ||
# vds = open_virtual_dataset(netcdf4_file, indexes={}) | ||
|
||
# instead for now just write out byte ranges explicitly | ||
manifest = ChunkManifest( | ||
{"0.0": {"path": simple_netcdf4, "offset": 6144, "length": 48}} | ||
) | ||
zarray = ZArray( | ||
shape=(3, 4), | ||
chunks=(3, 4), | ||
dtype=np.dtype("int32"), | ||
compressor=None, | ||
filters=None, | ||
fill_value=None, | ||
) | ||
ma = ManifestArray( | ||
chunkmanifest=manifest, | ||
zarray=zarray, | ||
) | ||
foo = Variable(data=ma, dims=["x", "y"]) | ||
vds = Dataset( | ||
{"foo": foo}, | ||
) | ||
|
||
dataset_to_icechunk(vds, icechunk_filestore) | ||
|
||
root_group = group(store=icechunk_filestore) | ||
array = root_group["foo"] | ||
|
||
# check chunk references | ||
# TODO we can't explicitly check that the path/offset/length is correct because icechunk doesn't yet expose any get_virtual_refs method | ||
|
||
expected_ds = open_dataset(simple_netcdf4) | ||
expected_array = expected_ds["foo"].to_numpy() | ||
npt.assert_equal(array, expected_array) | ||
|
||
# note: we don't need to test that committing works, because now we have confirmed | ||
# the refs are in the store (even uncommitted) it's icechunk's problem to manage them now. | ||
|
||
@pytest.mark.xfail(reason="Test doesn't account for scale factor encoding yet") | ||
def test_set_single_virtual_ref_with_encoding( | ||
self, icechunk_filestore: "IcechunkStore", netcdf4_file: Path | ||
): | ||
# TODO kerchunk doesn't work with zarr-python v3 yet so we can't use open_virtual_dataset and icechunk together! | ||
# vds = open_virtual_dataset(netcdf4_file, indexes={}) | ||
|
||
# instead for now just write out byte ranges explicitly | ||
manifest = ChunkManifest( | ||
{"0.0.0": {"path": netcdf4_file, "offset": 15419, "length": 7738000}} | ||
) | ||
zarray = ZArray( | ||
shape=(2920, 25, 53), | ||
chunks=(2920, 25, 53), | ||
dtype=np.dtype("int16"), | ||
compressor=None, | ||
filters=None, | ||
fill_value=None, | ||
) | ||
ma = ManifestArray( | ||
chunkmanifest=manifest, | ||
zarray=zarray, | ||
) | ||
air = Variable(data=ma, dims=["time", "lat", "lon"]) | ||
vds = Dataset( | ||
{"air": air}, | ||
) | ||
|
||
dataset_to_icechunk(vds, icechunk_filestore) | ||
|
||
root_group = group(store=icechunk_filestore) | ||
air_array = root_group["air"] | ||
print(air_array) | ||
|
||
# check chunk references | ||
# TODO we can't explicitly check that the path/offset/length is correct because icechunk doesn't yet expose any get_virtual_refs method | ||
|
||
expected_ds = open_dataset(netcdf4_file) | ||
expected_air_array = expected_ds["air"].to_numpy() | ||
npt.assert_equal(air_array, expected_air_array) | ||
|
||
# note: we don't need to test that committing works, because now we have confirmed | ||
# the refs are in the store (even uncommitted) it's icechunk's problem to manage them now. | ||
|
||
|
||
# TODO get test with encoding working | ||
|
||
# TODO test writing grids of multiple chunks | ||
|
||
# TODO test writing to a group that isn't the root group | ||
|
||
# TODO test writing loadable variables | ||
|
||
# TODO roundtripping tests - requires icechunk compatibility with xarray | ||
|
||
# TODO test with S3 / minio | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @mpiannucci, this is what's left to do here outside of performance concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. I will prob make a pr today with encoding tests, i have first pass working