Problems with distributed and opendap netCDF endpoint #2503

rabernat · 2018-10-23T17:48:20Z

Code Sample

I am trying to load a dataset from an opendap endpoint using xarray, netCDF4, and distributed. I am having a problem only with non-local distributed schedulers (KubeCluster specifically). This could plausibly be an xarray, dask, or pangeo issue, but I have decided to post it here.

import xarray as xr
import dask

# create dataset from Unidata's test opendap endpoint, chunked in time
url = 'http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/coads_climatology.nc'
ds = xr.open_dataset(url, decode_times=False, chunks={'TIME': 1})

# all these work
with dask.config.set(scheduler='synchronous'):
    ds.SST.compute()
with dask.config.set(scheduler='processes'):
    ds.SST.compute()
with dask.config.set(scheduler='threads'):
    ds.SST.compute()

# this works too
from dask.distributed import Client
local_client = Client()
with dask.config.set(get=local_client):
    ds.SST.compute()

# but this does not
cluster = KubeCluster(n_workers=2)
kube_client = Client(cluster)
with dask.config.set(get=kube_client):
    ds.SST.compute()

In the worker log, I see the following sort of errors.

distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 5, 0, 0)
distributed.worker - INFO - Dependent not found: open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 3, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 0, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 1, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 7, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 6, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 2, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 9, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 8, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 11, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 10, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 4, 0, 0)
distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f45d6fcbb38>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7f45d70507b8>, encoded_fill_values={-1e+34}, decoded_fill_value=nan, dtype=dtype('float32')), dtype=dtype('float32')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(3, 4, None), slice(0, 90, None), slice(0, 180, None))) kwargs: {} Exception: RuntimeError('NetCDF: Not a valid ID',)

Ultimately, the error comes from the netCDF library: RuntimeError('NetCDF: Not a valid ID',)

This seems like something to do with serialization of the netCDF store. The worker images have identical netcdf version (and all other package versions). I am at a loss for how to debug further.

Output of `xr.show_versions()`

xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.111+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.2
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: None
h5py: None
Nio: None
zarr: 2.2.0
bottleneck: None
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: 2.2.3
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: 18.0
conda: 4.5.4
pytest: 3.8.0
IPython: 6.4.0
sphinx: None

cube_client.get_versions(check=True)

{'scheduler': {'host': (('python', '3.6.3.final.0'),
   ('python-bits', 64),
   ('OS', 'Linux'),
   ('OS-release', '4.4.111+'),
   ('machine', 'x86_64'),
   ('processor', 'x86_64'),
   ('byteorder', 'little'),
   ('LC_ALL', 'en_US.UTF-8'),
   ('LANG', 'en_US.UTF-8'),
   ('LOCALE', 'en_US.UTF-8')),
  'packages': {'required': (('dask', '0.18.2'),
    ('distributed', '1.22.1'),
    ('msgpack', '0.5.6'),
    ('cloudpickle', '0.5.5'),
    ('tornado', '5.0.2'),
    ('toolz', '0.9.0')),
   'optional': (('numpy', '1.15.1'),
    ('pandas', '0.23.2'),
    ('bokeh', '0.12.16'),
    ('lz4', '1.1.0'),
    ('blosc', '1.5.1'))}},
 'workers': {'tcp://10.20.8.4:36940': {'host': (('python', '3.6.3.final.0'),
    ('python-bits', 64),
    ('OS', 'Linux'),
    ('OS-release', '4.4.111+'),
    ('machine', 'x86_64'),
    ('processor', 'x86_64'),
    ('byteorder', 'little'),
    ('LC_ALL', 'en_US.UTF-8'),
    ('LANG', 'en_US.UTF-8'),
    ('LOCALE', 'en_US.UTF-8')),
   'packages': {'required': (('dask', '0.18.2'),
     ('distributed', '1.22.1'),
     ('msgpack', '0.5.6'),
     ('cloudpickle', '0.5.5'),
     ('tornado', '5.0.2'),
     ('toolz', '0.9.0')),
    'optional': (('numpy', '1.15.1'),
     ('pandas', '0.23.2'),
     ('bokeh', '0.12.16'),
     ('lz4', '1.1.0'),
     ('blosc', '1.5.1'))}},
  'tcp://10.21.177.254:42939': {'host': (('python', '3.6.3.final.0'),
    ('python-bits', 64),
    ('OS', 'Linux'),
    ('OS-release', '4.4.111+'),
    ('machine', 'x86_64'),
    ('processor', 'x86_64'),
    ('byteorder', 'little'),
    ('LC_ALL', 'en_US.UTF-8'),
    ('LANG', 'en_US.UTF-8'),
    ('LOCALE', 'en_US.UTF-8')),
   'packages': {'required': (('dask', '0.18.2'),
     ('distributed', '1.22.1'),
     ('msgpack', '0.5.6'),
     ('cloudpickle', '0.5.5'),
     ('tornado', '5.0.2'),
     ('toolz', '0.9.0')),
    'optional': (('numpy', '1.15.1'),
     ('pandas', '0.23.2'),
     ('bokeh', '0.12.16'),
     ('lz4', '1.1.0'),
     ('blosc', '1.5.1'))}}},
 'client': {'host': [('python', '3.6.3.final.0'),
   ('python-bits', 64),
   ('OS', 'Linux'),
   ('OS-release', '4.4.111+'),
   ('machine', 'x86_64'),
   ('processor', 'x86_64'),
   ('byteorder', 'little'),
   ('LC_ALL', 'en_US.UTF-8'),
   ('LANG', 'en_US.UTF-8'),
   ('LOCALE', 'en_US.UTF-8')],
  'packages': {'required': [('dask', '0.18.2'),
    ('distributed', '1.22.1'),
    ('msgpack', '0.5.6'),
    ('cloudpickle', '0.5.5'),
    ('tornado', '5.0.2'),
    ('toolz', '0.9.0')],
   'optional': [('numpy', '1.15.1'),
    ('pandas', '0.23.2'),
    ('bokeh', '0.12.16'),
    ('lz4', '1.1.0'),
    ('blosc', '1.5.1')]}}}

The text was updated successfully, but these errors were encountered:

jhamman · 2018-10-23T17:58:46Z

I'm wondering if there is some authentication that is not being properly distributed to the workers. I'm not actually sure how opendap works in this case. Perhaps @dopplershift has some ideas here? Has anyone, maybe @rsignell-usgs, used the kubernetes cluster with opendap endpoints?

rabernat · 2018-10-23T18:01:39Z

This is fairly high priority for me, as it relates to the ongoing project to access CMIP6 data from an ESGF node running in google cloud (pangeo-data/pangeo#420).

rsignell-usgs · 2018-10-23T18:34:48Z

I tried a similar workflow last week with an AWS kubernetes cluster with opendap endpoints and it also failed: https://nbviewer.jupyter.org/gist/rsignell-usgs/8583ea8f8b5e1c926b0409bd536095a9

I thought it was likely some intermittent problem that wasn't handled well. In my case after a while I get:

distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7ff93cbbd828>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7ff945421378>, encoded_fill_values={1e+37}, decoded_fill_value=nan, dtype=dtype('float64')), dtype=dtype('float64')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(375, 400, None), slice(0, 7, None), slice(0, 670, None), slice(0, 300, None))) kwargs: {} Exception: OSError(-72, 'NetCDF: Malformed or inaccessible DAP DDS')

dopplershift · 2018-10-23T18:43:23Z

Just so I'm clear on how the workflow looks:

Open dataset with NetCDF/OPeNDAP
Serialize NetCDFDataStore (pickle? netcdf file?)
Ship to Dask workers
Reconstitute NetCDFDataStore

Certainly does seem like there's something stale in what the remote workers are getting. Confused why it works for the others, though.

I can prioritize this a bit and dig in to see what I can figure out--though I'm teaching through tomorrow. May be able to dig into this while at ECMWF.

rsignell-usgs · 2018-10-23T18:53:28Z

FWIW, in my workflow there was nothing fundamentally wrong, meaning that the requests worked for a while, but eventually would die with the NetCDF: Malformed or inaccessible DAP DDS message.

So for just a short time period (in this case 50 time steps, 2 chunks in time), it would usually work:
https://nbviewer.jupyter.org/gist/rsignell-usgs/1155c76ed3440858ced8132e4cd81df4

rsignell-usgs · 2018-10-23T19:39:09Z

Perhaps it's also worth mentioning that I don't see any errors on the THREDDS server side on either the tomcat catalina or thredds threddsServlet logs. @lesserwhirls, any ideas?

jhamman · 2018-10-23T19:41:41Z

@rsignell-usgs - are you able to tell if multiple processes (workers) have authenticated on the server side? I think this detail would really help us isolate the problem.

lesserwhirls · 2018-10-23T19:59:30Z

So for just a short time period (in this case 50 time steps, 2 chunks in time), it would usually work:

The "short time period" makes me wonder...@dopplershift - could this be due to the netCDF-C / curl timeout issue you mentioned today?

rsignell-usgs · 2018-10-23T20:54:24Z

@jhamman, doesn't this dask status plot tell us that multiple workers are connecting and getting data?

rsignell-usgs · 2018-10-23T20:55:42Z

@lesserwhirls , is this the issue you are referring to? Unidata/netcdf4-python#836

shoyer · 2018-10-23T20:55:57Z

@rabernat have you tried using the development version of xarray? I think we fixed a few serialization/ netCDF4 bugs with the backends refactor.

dopplershift · 2018-10-23T21:16:05Z

@lesserwhirls That's an interesting idea. (@rsignell-usgs That's the one.)

@rabernat What version of the conda-forge libnetcdf package is deployed wherever you're running?

rabernat · 2018-10-24T01:56:10Z

$ conda list libnetcdf
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
libnetcdf                 4.6.1                h10edf3e_1    defaults

rabernat · 2018-10-24T13:01:00Z

I created a binderized version of this issue with the latest dev xarray and fresh installs of all other packages:
https://github.com/rabernat/pangeo_esgf_demo

It appears to work fine!

rsignell-usgs · 2018-10-24T15:27:33Z

I fired up my notebook on @rabernat's binder env and it worked fine also:
https://nbviewer.jupyter.org/gist/rsignell-usgs/aebdac44a1d773b99673cb132c2ef5eb

dopplershift · 2018-10-24T16:54:05Z

The original version of libnetcdf in @rabernat 's environment definitely had the opendap timeout issue. Not sure if that's the root cause of the problem, or not, but it's suspect.

rsignell-usgs · 2018-10-24T17:02:34Z

The version that is working in @rabernat's esgf binder env is:

libnetcdf                 4.6.1               h9cd6fdc_11    conda-forge

dopplershift · 2018-10-24T17:06:01Z

That version has the fix for the issue.

ocefpaf · 2018-10-24T17:10:44Z

That version has the fix for the issue.

I know that @jjhelmus ported the fix to defaults but I'm not sure which build number has it, and/or if the previous one was remove, b/c defaults builds are not as transparent as conda-forge's 😄

He can probably say more about that.

dopplershift · 2018-10-24T17:13:14Z

Oh, I didn't even catch that the original was on defaults.

jjhelmus · 2018-10-24T18:50:48Z

In defaults libnetcdf4 4.6.1 build 1 and above contain the timeout fix, build 0 has the original timeout.

ocefpaf · 2018-10-24T18:53:22Z

In defaults libnetcdf4 4.6.1 build 1 and above contain the timeout fix, build 0 has the original timeout.

Thanks @jjhelmus! I guess that info and #2503 (comment) eliminates the timeout issue from the equation.

jjhelmus · 2018-10-24T19:04:44Z

h10edf3e_1 contains the timeout fix and is build against hdf5 1.10.2. The conda-forge h9cd6fdc_11 build is against hdf5 1.10.3 perhaps that makes a different?

ocefpaf · 2018-10-24T19:29:11Z

h10edf3e_1 contains the timeout fix and is build against hdf5 1.10.2. The conda-forge h9cd6fdc_11 build is against hdf5 1.10.3 perhaps that makes a different?

There are many variables at play here. The env that solved it in #2503 (comment) seems quite different from the env where the problem happened, including an xarray dev version. I'm not sure hdf5 is a good candidate to blame 😄

jhamman · 2019-01-13T21:20:51Z

@rabernat - do think this was resolved? If I'm understanding the thread correctly, it seems this was a libnetcdf version issue. Feel free to reopen if I've got that wrong.

rabernat · 2019-04-09T12:02:01Z

This works with latest libraries.

rabernat added the topic-dask label Oct 23, 2018

shoyer mentioned this issue Oct 24, 2018

xarray 0.11 release #2505

Closed

5 tasks

rabernat closed this as completed Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with distributed and opendap netCDF endpoint #2503

Problems with distributed and opendap netCDF endpoint #2503

rabernat commented Oct 23, 2018

jhamman commented Oct 23, 2018

rabernat commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018 •

edited

Loading

dopplershift commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018 •

edited

Loading

rsignell-usgs commented Oct 23, 2018

jhamman commented Oct 23, 2018

lesserwhirls commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018

shoyer commented Oct 23, 2018

dopplershift commented Oct 23, 2018 •

edited

Loading

rabernat commented Oct 24, 2018

rabernat commented Oct 24, 2018

rsignell-usgs commented Oct 24, 2018

dopplershift commented Oct 24, 2018

rsignell-usgs commented Oct 24, 2018

dopplershift commented Oct 24, 2018

ocefpaf commented Oct 24, 2018

dopplershift commented Oct 24, 2018

jjhelmus commented Oct 24, 2018

ocefpaf commented Oct 24, 2018

jjhelmus commented Oct 24, 2018

ocefpaf commented Oct 24, 2018

jhamman commented Jan 13, 2019

rabernat commented Apr 9, 2019

Problems with distributed and opendap netCDF endpoint #2503

Problems with distributed and opendap netCDF endpoint #2503

Comments

rabernat commented Oct 23, 2018

Code Sample

Output of xr.show_versions()

jhamman commented Oct 23, 2018

rabernat commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018 • edited Loading

dopplershift commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018 • edited Loading

rsignell-usgs commented Oct 23, 2018

jhamman commented Oct 23, 2018

lesserwhirls commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018

rsignell-usgs commented Oct 23, 2018

shoyer commented Oct 23, 2018

dopplershift commented Oct 23, 2018 • edited Loading

rabernat commented Oct 24, 2018

rabernat commented Oct 24, 2018

rsignell-usgs commented Oct 24, 2018

dopplershift commented Oct 24, 2018

rsignell-usgs commented Oct 24, 2018

dopplershift commented Oct 24, 2018

ocefpaf commented Oct 24, 2018

dopplershift commented Oct 24, 2018

jjhelmus commented Oct 24, 2018

ocefpaf commented Oct 24, 2018

jjhelmus commented Oct 24, 2018

ocefpaf commented Oct 24, 2018

jhamman commented Jan 13, 2019

rabernat commented Apr 9, 2019

Output of `xr.show_versions()`

rsignell-usgs commented Oct 23, 2018 •

edited

Loading

rsignell-usgs commented Oct 23, 2018 •

edited

Loading

dopplershift commented Oct 23, 2018 •

edited

Loading