Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial notebook for open_virtual_dataset #903

Open
ayushnag opened this issue Dec 17, 2024 · 25 comments · May be fixed by #924
Open

Add tutorial notebook for open_virtual_dataset #903

ayushnag opened this issue Dec 17, 2024 · 25 comments · May be fixed by #924

Comments

@ayushnag
Copy link
Collaborator

The newly added open_virtual_dataset and open_virtual_mfdataset functions need a tutorial notebook to show example usage. The current sections I have planned are:

  1. Opening a dataset
  2. Opening a virtual dataset, saving the reference file, and loading the dataset with xarray
  3. Opening a virtual dataset with groups
  4. Opening a virtual dataset with preprocessing

cc @betolink @TomNicholas @danielfromearth

@betolink
Copy link
Member

Thanks for opening this @ayushnag!

@danielfromearth
Copy link
Collaborator

danielfromearth commented Dec 17, 2024

I just tried running this new functionality in an Openscapes Jupyterhub instance. Used !pip install . --upgrade to install earthaccess directly from the main branch (after running git pull).

A ModuleNotFoundError: No module named 'virtualizarr' error was raised when executing the earthaccess.open_virtual_mfdataset() function. Does virtualizarr need to be added to the pyproject.toml?

@danielfromearth
Copy link
Collaborator

Oh! I see now that it is declared in the optional-dependencies section of the pyproject.toml. So, I should be able to get past that error now. Do we want to move it into the main dependencies though?

@ayushnag
Copy link
Collaborator Author

I think doing pip install "earthaccess[virtualizarr]" will work once this code is in the release version. For now I believe the method is pip install 'git+https://github.com/nsidc/earthaccess.git@main#egg=earthaccess[virtualizarr]'

@chuckwondo
Copy link
Collaborator

You should be able to do pip install .[virtualizarr]

@danielfromearth
Copy link
Collaborator

danielfromearth commented Dec 17, 2024

After installing via pip install .[virtualizarr], I get a few couple messages after that. Thoughts on the following?

1) A numpy version warning upon import of xarray (though perhaps this is simply a result of pip installing earthaccess into the base environment for Openscapes' Jupyterhub?):

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

2. A ValueError during an HDF import statement in virtualizarr:

granules = earthaccess.search_data(
    short_name="TEMPO_NO2_L2",
    version="V03",
    count=3
)

result = earthaccess.open_virtual_mfdataset(
    granules=granules, 
    access="indirect", 
    concat_dim="time",  
    parallel=False, 
    preprocess=None
)

...raises the following (note: I truncated the beginning of the callback, which starts with File [~/earthaccess/earthaccess/dmrpp_zarr.py:91], in open_virtual_mfdataset(granules, group, access, load, preprocess, parallel, **xr_combine_nested_kwargs))...

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/hdf.py:20](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/hdf.py#line=19)
     14 from virtualizarr.manifests.manifest import validate_and_normalize_path_to_uri
     15 from virtualizarr.readers.common import (
     16     VirtualBackend,
     17     construct_virtual_dataset,
     18     open_loadable_vars_and_indexes,
     19 )
---> 20 from virtualizarr.readers.hdf.filters import cfcodec_from_dataset, codecs_from_dataset
     21 from virtualizarr.types import ChunkKey
     22 from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions, soft_import

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/filters.py:16](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/readers/hdf/filters.py#line=15)
     13     import h5py  # type: ignore
     14     from h5py import Dataset, Group  # type: ignore
---> 16 h5py = soft_import("h5py", "For reading hdf files", strict=False)
     17 if h5py:
     18     Dataset = h5py.Dataset

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/utils.py:94](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/virtualizarr/utils.py#line=93), in soft_import(name, reason, strict)
     92 def soft_import(name: str, reason: str, strict: Optional[bool] = True):
     93     try:
---> 94         return importlib.import_module(name)
     95     except (ImportError, ModuleNotFoundError):
     96         if strict:

File [/srv/conda/envs/notebook/lib/python3.10/importlib/__init__.py:126](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/importlib/__init__.py#line=125), in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File [/srv/conda/envs/notebook/lib/python3.10/site-packages/h5py/__init__.py:25](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.10/site-packages/h5py/__init__.py#line=24)
     19 # --- Library setup -----------------------------------------------------------
     20 
     21 # When importing from the root of the unpacked tarball or git checkout,
     22 # Python sees the "h5py" source directory and tries to load it, which fails.
     23 # We tried working around this by using "package_dir" but that breaks Cython.
     24 try:
---> 25     from . import _errors
     26 except ImportError:
     27     import os.path as _op

File h5py[/_errors.pyx:1](https://openscapes.2i2c.cloud/_errors.pyx#line=0), in init h5py._errors()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

@chuckwondo
Copy link
Collaborator

though perhaps this is simply a result of pip installing earthaccess into the base environment for Openscapes' Jupyterhub?

Definitely could be problematic, so try creating a new env instead, and see if you get the same problems.

@TomNicholas
Copy link

@danielfromearth (1) and (2) are related - VirtualiZarr has a hard requirement to use numpy 2.0.0 or later because it makes a lot of use of the new variable-length string style internally. I think something in your environment is not compiled with numpy 2.

@battistowx
Copy link
Collaborator

battistowx commented Dec 19, 2024

@danielfromearth I'm currently using hatch and the pyproject.toml to experiment on my local desktop, I would avoid using the 2i2c Jupyterhub for now since it has dependency locks

@danielfromearth
Copy link
Collaborator

danielfromearth commented Dec 20, 2024

@battistowx, I've also tried it locally without issue, but part of my goal with testing was to try it in-region on AWS West 2. I'm still not sure how to modify the environment in 2i2c Jupyterhub, and if there are dependency locks, is there a way around that for testing purposes?

@betolink
Copy link
Member

@danielfromearth great timing!! I was testing the latest image for the hub, try restarting your instance in the admin console and use openscapes/python:07980b9 we still need to reinstall earthaccess from source because the virtualizar work has not been released. We should release it today or Monday!

@danielfromearth
Copy link
Collaborator

danielfromearth commented Dec 20, 2024

Okay, just tried this in an instance of openscapes/python:07980b9. When running this same code as in my previous comment:

result = earthaccess.open_virtual_mfdataset(
    granules=results, 
    access="indirect", 
    concat_dim="time",  
    parallel=False, 
    preprocess=None
)

The following new error is being raised by xarray:

TypeError: Could not find a Chunk Manager which recognises type <class 'virtualizarr.manifests.array.ManifestArray'>

This is out of my depth. Is it missing some optional dependency, something that provides a "Chunk Manager" that will know how to handle ManifestArrays?

@ayushnag
Copy link
Collaborator Author

@danielfromearth You need to also pass in the arguments coords='minimal', compat='override' to open_virtual_mfdataset to avoid xarray trying to load indexes on this dataset. Note that combining references will become much easier when virtualizarr natively supports open_virtual_mfdataset. The current earthaccess.open_virtual_mfdataset implements some logic that will become upstream in virtualizarr. Then we can call that function in earthaccess

@danielfromearth

This comment was marked as outdated.

@danielfromearth
Copy link
Collaborator

Please disregard my previous (now-hidden) comment! I discovered a typo that prevented the code from working (I was opening the same group twice and then trying to merge it with itself).

@TomNicholas
Copy link

TomNicholas commented Dec 20, 2024

Sorry @danielfromearth - VirtualiZarr's xarray-at-the-top design means that sometimes obscure errors are thrown from deep inside xarray, that VirtualiZarr doesn't have control to re-raise with clearer messages. I have issues to track ways to make them clearer, but changing xarray is a more involved process than changing VirtualiZarr. (xref zarr-developers/VirtualiZarr#114 and pydata/xarray#8778)

For xr.merge you may also need to pass the same compat='override' again.

I've tried to document this here but if you think any of this could be clearer in the VirtualiZarr docs please raise an issue there :)

@betolink
Copy link
Member

before opening an issue in VirtualiZarr, I was trying to open 1000 MUR granules and ran into this:

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:369](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=368), in IndexCallable.__getitem__(self, key)
    368 def __getitem__(self, key: Any) -> Any:
--> 369     return self.getter(key)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py:1508](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/core/indexing.py#line=1507), in NumpyIndexingAdapter._oindex_get(self, indexer)
   1506 def _oindex_get(self, indexer: OuterIndexer):
   1507     key = _outer_to_numpy_indexer(indexer, self.array.shape)
-> 1508     return self.array[key]

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py:214](https://openscapes.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/manifests/array.py#line=213), in ManifestArray.__getitem__(self, key)
    212     return self
    213 else:
--> 214     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

does it ring a bell? it worked for a few granules 🤔

@ayushnag
Copy link
Collaborator Author

@betolink Could you share the code you used to get the error?

@TomNicholas
Copy link

NotImplementedError(f"Doesn't support slicing with {indexer}")

Could be zarr-developers/VirtualiZarr#51, but I would need more context.

Also please raise new issues on VirtualiZarr!! That way other people will see them and have the opportunity to jump in and help.

@betolink
Copy link
Member

Opened zarr-developers/VirtualiZarr#360

@battistowx
Copy link
Collaborator

I've ran into a few interesting errors as well when certain dependencies are not installed. I'll document those as well in Virtualizarr! Also, I'd love to be a part of this tutorial project and help where I can!

@ayushnag
Copy link
Collaborator Author

@battistowx What specific errors have you ran into? Perhaps that is an error in the earthaccess dependencies not virtualizarr that we need to add to the earthaccess[virtualizarr] optional dependency group. One I know of for sure is to add zarr. Since we have the load=True param we need zarr for that portion.

@battistowx
Copy link
Collaborator

@ayushnag Yes, adding Zarr was an easy fix and that error appeared when importing earthaccess. I was also getting TypeError: Union[arg, ...]: each arg must be a type. in _extract_attrs when h5py was not installed, and a KeyError when using the access='direct' argument outside of us-west-2. These only require minor fixes, such as clearer exception statements and additions to the earthaccess dependency group.

@ayushnag
Copy link
Collaborator Author

@battistowx We can fix most of the dependency problems with an added integration test that checks that load=True works. The access='direct' one is challenging since the only way we can catch that is by knowing whether the user is in us-west-2 or not. However I think this is an ongoing issue since we don't have a clear way of determining if a user is "in-region"

@battistowx
Copy link
Collaborator

@ayushnag No worries, we already put large warning banners in our in-region notebooks to warn the user of this, and so far that seems to be an effective way of doing it.

Currently, I'm working on testing MERRA-2, and some other collections that are currently Cloud OPeNDAP-enabled at GES DISC. I'll record my testing experience in a different Issues thread so that we can keep this tutorial-relevant.

@danielfromearth danielfromearth linked a pull request Jan 8, 2025 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

6 participants