Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of eNATL60 recipe #24

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

roxyboy
Copy link

@roxyboy roxyboy commented Mar 20, 2021

Added edits to pipeline.py to ingest eNATL60 data.

@roxyboy roxyboy changed the title First commit for eNATL60 recipe Draft of eNATL60 recipe Mar 20, 2021
@rabernat
Copy link
Contributor

Hi @roxyboy - sorry for confusion here. The recipe should use the new syntax specified in the docs. See #20 for an example.

@rabernat
Copy link
Contributor

And I should mention that these instructions are really not clear yet or up to date. Mostly we just need the python code to generate a Recipe object. This tutorial explains it in detail.

@roxyboy
Copy link
Author

roxyboy commented Mar 22, 2021

And I should mention that these instructions are really not clear yet or up to date. Mostly we just need the python code to generate a Recipe object. This tutorial explains it in detail.

Based on the tutorial, the recipe should no longer be a python script but rather be executed via a Jupyter notebook...?

@rabernat
Copy link
Contributor

No it's a python file. See #20.

@roxyboy
Copy link
Author

roxyboy commented Mar 22, 2021

@rabernat I pushed a commit where I basically copy and pasted PR #20 and edited it to suit eNATL60.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Takaya!

Have you actually tried running this recipe locally? I can't imagine it would have worked as is, due to the different size of the regions.

recipes/example/eNATL60/recipe.py Outdated Show resolved Hide resolved
recipes/example/eNATL60/meta.yml Outdated Show resolved Hide resolved
recipes/example/eNATL60/meta.yml Outdated Show resolved Hide resolved
recipes/example/eNATL60/meta.yml Outdated Show resolved Hide resolved
@roxyboy
Copy link
Author

roxyboy commented Mar 22, 2021

Have you actually tried running this recipe locally? I can't imagine it would have worked as is, due to the different size of the regions.

By running this locally and assuming I make it work, it will push the data to OSM...?

@rabernat
Copy link
Contributor

By running this locally and assuming I make it work, it will push the data to OSM...?

Locally you would assign other (local) storage targets and step through the recipe one bit at a time, as in the tutorial. You don't need to run the whole thing. But this allows you to debug your own recipe rather than just guess.

@roxyboy
Copy link
Author

roxyboy commented Mar 22, 2021

I'm getting the following import error from rechunker when trying to run the recipe locally but is this version specific...?

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-166b08dae6ec> in <module>
      1 import pandas as pd
----> 2 from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe
      3 get_ipython().run_line_magic('pinfo', 'NetCDFtoZarrSequentialRecipe')

/mnt/meom/workdir/uchidat/pangeo-forge/pangeo_forge/recipe.py in <module>
     13 import xarray as xr
     14 import zarr
---> 15 from rechunker.types import MultiStagePipeline, ParallelPipelines, Stage
     16 
     17 from .patterns import ExplicitURLSequence, VariableSequencePattern

ImportError: cannot import name 'MultiStagePipeline' from 'rechunker.types' (/mnt/meom/workdir/uchidat/miniconda3/envs/pangeo/lib/python3.8/site-packages/rechunker-0.3.3-py3.8.egg/rechunker/types.py)

@rabernat
Copy link
Contributor

ah you have to install both pangeo-forge and rechunker from github master. Thanks for your patience experimenting with bleeding-edge software!

@rabernat
Copy link
Contributor

This is useful. We can stop here for now. Once we have this recipe working, we can add others. Many recipes can live in the same .py file.

@roxyboy
Copy link
Author

roxyboy commented Mar 24, 2021

Is there a flexible way to prescribe the chunk size? It seems that the flag inputs_per_chunk is meant to do this but can I prescribe the chuck size per dimension...?

@roxyboy
Copy link
Author

roxyboy commented Mar 24, 2021

I'm getting the following error when running:

for input_file in recipe.inputs_for_chunk(all_chunks[0]):
    recipe.cache_input(input_file)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-18-8c67504fb058> in <module>
      1 for input_file in recipe.inputs_for_chunk(all_chunks[0]):
----> 2     recipe.cache_input(input_file)

/mnt/meom/workdir/uchidat/pangeo-forge/pangeo_forge/recipe.py in cache_func(input_key)
    290                         if not data:
    291                             break
--> 292                         target.write(data)
    293 
    294             if self._cache_metadata:

OSError: [Errno 28] No space left on device

I'm assuming this is due to the chunk sizes being too large... but is this understanding correct?

@rabernat
Copy link
Contributor

The recipe will first cache the entire dataset by downloading the files. It looks like you've filled up your hard disk.

You probably don't want to run the entire recipe locally. Instead, try to just run a few steps, as in this example: https://pangeo-forge.readthedocs.io/en/latest/tutorials/netcdf_zarr_sequential.html#step-5-create-storage-targets

I'm also going to play with this recipe today to see if I can get it working.

@roxyboy
Copy link
Author

roxyboy commented Mar 26, 2021

I'm also going to play with this recipe today to see if I can get it working.

Could you let me know if you found further fixes I'd need for the recipe? Also Yuvi asked us whether there's any data on OSN he could test with for the SWOT-AdAC Jupyterhub deployment.

@rabernat
Copy link
Contributor

The region 1 dataset is now on OSN! 🎉

import xarray as xr
osn_url = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/swot_adac/eNATL60_surface_region_1'
ds = xr.open_zarr(osn_url, consolidated=True)
print(ds)
<xarray.Dataset>
Dimensions:        (time_counter: 2136, x: 611, y: 763)
Coordinates: (12/15)
    depth          (y, x) float32 dask.array<chunksize=(763, 611), meta=np.ndarray>
    e1f            (y, x) float64 dask.array<chunksize=(763, 611), meta=np.ndarray>
    e1t            (y, x) float64 dask.array<chunksize=(763, 611), meta=np.ndarray>
    e1u            (y, x) float64 dask.array<chunksize=(763, 611), meta=np.ndarray>
    e1v            (y, x) float64 dask.array<chunksize=(763, 611), meta=np.ndarray>
    e2f            (y, x) float64 dask.array<chunksize=(763, 611), meta=np.ndarray>
    ...             ...
    lat            (y, x) float32 dask.array<chunksize=(763, 611), meta=np.ndarray>
    lon            (y, x) float32 dask.array<chunksize=(763, 611), meta=np.ndarray>
    nav_lat        (y, x) float32 dask.array<chunksize=(763, 611), meta=np.ndarray>
    nav_lon        (y, x) float32 dask.array<chunksize=(763, 611), meta=np.ndarray>
    time_centered  (time_counter) datetime64[ns] dask.array<chunksize=(72,), meta=np.ndarray>
  * time_counter   (time_counter) datetime64[ns] 2010-02-01T00:30:00 ... 2010...
Dimensions without coordinates: x, y
Data variables: (12/13)
    fmask          (y, x) int8 dask.array<chunksize=(763, 611), meta=np.ndarray>
    qt_oce         (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    somecrty       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    sometauy       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    sosaline       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    sossheig       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    ...             ...
    sowaflup       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    sozocrtx       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    sozotaux       (time_counter, y, x) float32 dask.array<chunksize=(72, 763, 611), meta=np.ndarray>
    tmask          (y, x) int8 dask.array<chunksize=(763, 611), meta=np.ndarray>
    umask          (y, x) int8 dask.array<chunksize=(763, 611), meta=np.ndarray>
    vmask          (y, x) int8 dask.array<chunksize=(763, 611), meta=np.ndarray>

I have to say, pangeo forge worked like a charm!

@rabernat
Copy link
Contributor

The latest version of the full recipe I am using is

from itertools import product

from pangeo_forge.patterns import pattern_from_file_sequence
from pangeo_forge.recipes import XarrayZarrRecipe
import pandas as pd


regions = [1, 2, 3]
season_months = {
    'fma': pd.date_range("2010-02", "2010-05", freq="M"),
    'aso': pd.date_range("2009-08", "2009-10", freq="M")
}

url_base = (
    "https://ige-meom-opendap.univ-grenoble-alpes.fr"
    "/thredds/fileServer/meomopendap/extract/SWOT-Adac"
)


def make_recipe_surface(region, season):
    input_url_pattern = url_base + "/Surface/eNATL60/Region{reg:02d}-surface-hourly_{yymm}.nc"
    months = season_months[season]
    input_urls = [input_url_pattern.format(reg=region, yymm=date.strftime("%Y-%m"))
                  for date in months]
    file_pattern = pattern_from_file_sequence(input_urls, "time_counter")

    recipe = XarrayZarrRecipe(
        file_pattern,
        target_chunks={'time_counter': 72}
    )
    
    return recipe


def make_recipe_interior(region, season):
    input_url_pattern = url_base + "/Interior/eNATL60/Region{reg:02d}-interior-daily_{yymm}.nc"    
    months = season_months[season]
    input_urls = [input_url_pattern.format(reg=region, yymm=date.strftime("%Y-%m"))
                  for date in months]
    file_pattern = pattern_from_file_sequence(input_urls, "time_counter")

    recipe = XarrayZarrRecipe(
        file_pattern,
        target_chunks={'time_counter': "50MB"}
    )
    return recipe


recipes = {f'eNATL60/Region{reg:02d}/surface_hourly/{season}': make_recipe_surface(reg, season)
           for reg, season in product(regions, season_months)}

recipes.update(
    {f'eNATL60/Region{reg:02d}/interior_daily/{season}': make_recipe_interior(reg, season)
     for reg, season in product(regions, season_months)}
)

cc @cisaacstern

@cisaacstern
Copy link
Member

Noting that pangeo-forge/pangeo-forge-recipes#164 is a blocker for running the interior recipes. Working on that issue now and hope to have some solutions soon.

@cisaacstern
Copy link
Member

cisaacstern commented Jul 21, 2021

eNATL60 interior data is now on OSN, thanks to @rabernat's efforts on pangeo-forge/pangeo-forge-recipes#171 and pangeo-forge/pangeo-forge-recipes#166

It can be accessed via the swot_adac_ogcms catalog, xref pangeo-data/swot_adac_ogcms#3

@roxyboy
Copy link
Author

roxyboy commented Aug 2, 2021

I'm getting the following error for the summer interior data:

enatl01s = cat.eNATL60(region='1',depth='interior_daily', season='aso').to_dask()
---------------------------------------------------------------------------
NoSuchKey                                 Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    233             try:
--> 234                 return await method(**additional_kwargs)
    235             except S3_RETRYABLE_ERRORS as e:

/srv/conda/envs/notebook/lib/python3.8/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
    153             error_class = self.exceptions.from_code(error_code)
--> 154             raise error_class(parsed_response, operation_name)
    155         else:

NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: Unknown

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
    131         try:
--> 132             result = self.fs.cat(k)
    133         except self.missing_exceptions:

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py in cat(self, path, recursive, on_error, **kwargs)
    240             if ex:
--> 241                 raise ex
    242         if (

/srv/conda/envs/notebook/lib/python3.8/site-packages/s3fs/core.py in _cat_file(self, path, version_id, start, end)
    736             head = {}
--> 737         resp = await self._call_s3(
    738             self.s3.get_object,

/srv/conda/envs/notebook/lib/python3.8/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    251                 err = e
--> 252         raise translate_boto_error(err) from err
    253 

FileNotFoundError: An error occurred (NoSuchKey) when calling the GetObject operation: Unknown

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-11-301b31c96283> in <module>
      1 enatl01w = cat.eNATL60(region='1',depth='interior_daily', season='fma').to_dask()
----> 2 enatl01s = cat.eNATL60(region='1',depth='interior_daily', season='aso').to_dask()

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake/source/base.py in _load_metadata(self)
    234         """load metadata only if needed"""
    235         if self._schema is None:
--> 236             self._schema = self._get_schema()
    237             self.dtype = self._schema.dtype
    238             self.shape = self._schema.shape

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

/srv/conda/envs/notebook/lib/python3.8/site-packages/intake_xarray/xzarr.py in _open_dataset(self)
     29 
     30         self._mapper = get_mapper(self.urlpath, **self.storage_options)
---> 31         self._ds = xr.open_zarr(self._mapper, **self.kwargs)
     32 
     33     def close(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, decode_timedelta, use_cftime, **kwargs)
    673     }
    674 
--> 675     ds = open_dataset(
    676         filename_or_obj=store,
    677         group=group,

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    570 
    571         opener = _get_backend_cls(engine)
--> 572         store = opener(filename_or_obj, **extra_kwargs, **backend_kwargs)
    573 
    574     with close_on_error(store):

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, append_dim, write_region)
    292         if consolidated:
    293             # TODO: an option to pass the metadata_key keyword
--> 294             zarr_group = zarr.open_consolidated(store, **open_kwargs)
    295         else:
    296             zarr_group = zarr.open_group(store, **open_kwargs)

/srv/conda/envs/notebook/lib/python3.8/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs)
   1176 
   1177     # setup metadata store
-> 1178     meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
   1179 
   1180     # pass through

/srv/conda/envs/notebook/lib/python3.8/site-packages/zarr/storage.py in __init__(self, store, metadata_key)
   2678 
   2679         # retrieve consolidated metadata
-> 2680         meta = json_loads(store[metadata_key])
   2681 
   2682         # check format of consolidated metadata

/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
    134             if default is not None:
    135                 return default
--> 136             raise KeyError(key)
    137         return result
    138 

KeyError: '.zmetadata'

Winter data seems to be fine :)

@cisaacstern
Copy link
Member

@roxyboy, oops! Looks like I'd forgotten to consolidate zarr metadata for that dataset. It should be corrected now.

@roxyboy
Copy link
Author

roxyboy commented Sep 28, 2021

@cisaacstern I've noticed that the summer interior data only has 61 time steps, which is weird for three months of daily data... I'll check if I made a mistake in extracting the data but could you also check if the zarr metadata consolidation was correct?

@cisaacstern
Copy link
Member

@roxyboy I've rebuilt all six of the eNATL60 summer datasets (both interior and surface, for all three regions) to include the previously missing October data. The length of time_counter is now 92 for interior and 2208 for surface. Of course let me know if anything seems out of place.

@andersy005
Copy link
Member

hi folks, i'm curious... is this recipe related to or a duplicate of https://github.com/pangeo-forge/eNATL60-feedstock (maintained by @auraoupa)

@auraoupa
Copy link
Contributor

Not a duplicate, it is the same simulation but different extractions : here we have extracted some sub region in 3D (300 levels), in https://github.com/pangeo-forge/eNATL60-feedstock it is the whole North Atlantic at some depths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
swot-adac SWOT Adopt-a-Crossover Dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants