Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with distributed and opendap netCDF endpoint #2503

Closed
rabernat opened this issue Oct 23, 2018 · 26 comments
Closed

Problems with distributed and opendap netCDF endpoint #2503

rabernat opened this issue Oct 23, 2018 · 26 comments

Comments

@rabernat
Copy link
Contributor

Code Sample

I am trying to load a dataset from an opendap endpoint using xarray, netCDF4, and distributed. I am having a problem only with non-local distributed schedulers (KubeCluster specifically). This could plausibly be an xarray, dask, or pangeo issue, but I have decided to post it here.

import xarray as xr
import dask

# create dataset from Unidata's test opendap endpoint, chunked in time
url = 'http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/coads_climatology.nc'
ds = xr.open_dataset(url, decode_times=False, chunks={'TIME': 1})

# all these work
with dask.config.set(scheduler='synchronous'):
    ds.SST.compute()
with dask.config.set(scheduler='processes'):
    ds.SST.compute()
with dask.config.set(scheduler='threads'):
    ds.SST.compute()

# this works too
from dask.distributed import Client
local_client = Client()
with dask.config.set(get=local_client):
    ds.SST.compute()

# but this does not
cluster = KubeCluster(n_workers=2)
kube_client = Client(cluster)
with dask.config.set(get=kube_client):
    ds.SST.compute()

In the worker log, I see the following sort of errors.

distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 5, 0, 0)
distributed.worker - INFO - Dependent not found: open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 3, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 0, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 1, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 7, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 6, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 2, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 9, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 8, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 11, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 10, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('open_dataset-4a0403564ad0e45788e42887b9bc0997SST-9fd3e5906a2a54cb28f48a7f2d46e4bf', 4, 0, 0)
distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f45d6fcbb38>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7f45d70507b8>, encoded_fill_values={-1e+34}, decoded_fill_value=nan, dtype=dtype('float32')), dtype=dtype('float32')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(3, 4, None), slice(0, 90, None), slice(0, 180, None))) kwargs: {} Exception: RuntimeError('NetCDF: Not a valid ID',)

Ultimately, the error comes from the netCDF library: RuntimeError('NetCDF: Not a valid ID',)

This seems like something to do with serialization of the netCDF store. The worker images have identical netcdf version (and all other package versions). I am at a loss for how to debug further.

Output of xr.show_versions()

xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.111+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.2
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: None
h5py: None
Nio: None
zarr: 2.2.0
bottleneck: None
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: 2.2.3
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: 18.0
conda: 4.5.4
pytest: 3.8.0
IPython: 6.4.0
sphinx: None

cube_client.get_versions(check=True)

{'scheduler': {'host': (('python', '3.6.3.final.0'),
   ('python-bits', 64),
   ('OS', 'Linux'),
   ('OS-release', '4.4.111+'),
   ('machine', 'x86_64'),
   ('processor', 'x86_64'),
   ('byteorder', 'little'),
   ('LC_ALL', 'en_US.UTF-8'),
   ('LANG', 'en_US.UTF-8'),
   ('LOCALE', 'en_US.UTF-8')),
  'packages': {'required': (('dask', '0.18.2'),
    ('distributed', '1.22.1'),
    ('msgpack', '0.5.6'),
    ('cloudpickle', '0.5.5'),
    ('tornado', '5.0.2'),
    ('toolz', '0.9.0')),
   'optional': (('numpy', '1.15.1'),
    ('pandas', '0.23.2'),
    ('bokeh', '0.12.16'),
    ('lz4', '1.1.0'),
    ('blosc', '1.5.1'))}},
 'workers': {'tcp://10.20.8.4:36940': {'host': (('python', '3.6.3.final.0'),
    ('python-bits', 64),
    ('OS', 'Linux'),
    ('OS-release', '4.4.111+'),
    ('machine', 'x86_64'),
    ('processor', 'x86_64'),
    ('byteorder', 'little'),
    ('LC_ALL', 'en_US.UTF-8'),
    ('LANG', 'en_US.UTF-8'),
    ('LOCALE', 'en_US.UTF-8')),
   'packages': {'required': (('dask', '0.18.2'),
     ('distributed', '1.22.1'),
     ('msgpack', '0.5.6'),
     ('cloudpickle', '0.5.5'),
     ('tornado', '5.0.2'),
     ('toolz', '0.9.0')),
    'optional': (('numpy', '1.15.1'),
     ('pandas', '0.23.2'),
     ('bokeh', '0.12.16'),
     ('lz4', '1.1.0'),
     ('blosc', '1.5.1'))}},
  'tcp://10.21.177.254:42939': {'host': (('python', '3.6.3.final.0'),
    ('python-bits', 64),
    ('OS', 'Linux'),
    ('OS-release', '4.4.111+'),
    ('machine', 'x86_64'),
    ('processor', 'x86_64'),
    ('byteorder', 'little'),
    ('LC_ALL', 'en_US.UTF-8'),
    ('LANG', 'en_US.UTF-8'),
    ('LOCALE', 'en_US.UTF-8')),
   'packages': {'required': (('dask', '0.18.2'),
     ('distributed', '1.22.1'),
     ('msgpack', '0.5.6'),
     ('cloudpickle', '0.5.5'),
     ('tornado', '5.0.2'),
     ('toolz', '0.9.0')),
    'optional': (('numpy', '1.15.1'),
     ('pandas', '0.23.2'),
     ('bokeh', '0.12.16'),
     ('lz4', '1.1.0'),
     ('blosc', '1.5.1'))}}},
 'client': {'host': [('python', '3.6.3.final.0'),
   ('python-bits', 64),
   ('OS', 'Linux'),
   ('OS-release', '4.4.111+'),
   ('machine', 'x86_64'),
   ('processor', 'x86_64'),
   ('byteorder', 'little'),
   ('LC_ALL', 'en_US.UTF-8'),
   ('LANG', 'en_US.UTF-8'),
   ('LOCALE', 'en_US.UTF-8')],
  'packages': {'required': [('dask', '0.18.2'),
    ('distributed', '1.22.1'),
    ('msgpack', '0.5.6'),
    ('cloudpickle', '0.5.5'),
    ('tornado', '5.0.2'),
    ('toolz', '0.9.0')],
   'optional': [('numpy', '1.15.1'),
    ('pandas', '0.23.2'),
    ('bokeh', '0.12.16'),
    ('lz4', '1.1.0'),
    ('blosc', '1.5.1')]}}}
@jhamman
Copy link
Member

jhamman commented Oct 23, 2018

I'm wondering if there is some authentication that is not being properly distributed to the workers. I'm not actually sure how opendap works in this case. Perhaps @dopplershift has some ideas here? Has anyone, maybe @rsignell-usgs, used the kubernetes cluster with opendap endpoints?

@rabernat
Copy link
Contributor Author

This is fairly high priority for me, as it relates to the ongoing project to access CMIP6 data from an ESGF node running in google cloud (pangeo-data/pangeo#420).

@rsignell-usgs
Copy link

rsignell-usgs commented Oct 23, 2018

I tried a similar workflow last week with an AWS kubernetes cluster with opendap endpoints and it also failed: https://nbviewer.jupyter.org/gist/rsignell-usgs/8583ea8f8b5e1c926b0409bd536095a9

I thought it was likely some intermittent problem that wasn't handled well. In my case after a while I get:

distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7ff93cbbd828>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7ff945421378>, encoded_fill_values={1e+37}, decoded_fill_value=nan, dtype=dtype('float64')), dtype=dtype('float64')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(375, 400, None), slice(0, 7, None), slice(0, 670, None), slice(0, 300, None))) kwargs: {} Exception: OSError(-72, 'NetCDF: Malformed or inaccessible DAP DDS')

@dopplershift
Copy link
Contributor

Just so I'm clear on how the workflow looks:

  1. Open dataset with NetCDF/OPeNDAP
  2. Serialize NetCDFDataStore (pickle? netcdf file?)
  3. Ship to Dask workers
  4. Reconstitute NetCDFDataStore

Certainly does seem like there's something stale in what the remote workers are getting. Confused why it works for the others, though.

I can prioritize this a bit and dig in to see what I can figure out--though I'm teaching through tomorrow. May be able to dig into this while at ECMWF.

@rsignell-usgs
Copy link

rsignell-usgs commented Oct 23, 2018

FWIW, in my workflow there was nothing fundamentally wrong, meaning that the requests worked for a while, but eventually would die with the NetCDF: Malformed or inaccessible DAP DDS message.

So for just a short time period (in this case 50 time steps, 2 chunks in time), it would usually work:
https://nbviewer.jupyter.org/gist/rsignell-usgs/1155c76ed3440858ced8132e4cd81df4

@rsignell-usgs
Copy link

Perhaps it's also worth mentioning that I don't see any errors on the THREDDS server side on either the tomcat catalina or thredds threddsServlet logs. @lesserwhirls, any ideas?

@jhamman
Copy link
Member

jhamman commented Oct 23, 2018

@rsignell-usgs - are you able to tell if multiple processes (workers) have authenticated on the server side? I think this detail would really help us isolate the problem.

@lesserwhirls
Copy link

So for just a short time period (in this case 50 time steps, 2 chunks in time), it would usually work:

The "short time period" makes me wonder...@dopplershift - could this be due to the netCDF-C / curl timeout issue you mentioned today?

@rsignell-usgs
Copy link

@jhamman, doesn't this dask status plot tell us that multiple workers are connecting and getting data?
2018-10-23_16-53-20

@rsignell-usgs
Copy link

@lesserwhirls , is this the issue you are referring to? Unidata/netcdf4-python#836

@shoyer
Copy link
Member

shoyer commented Oct 23, 2018

@rabernat have you tried using the development version of xarray? I think we fixed a few serialization/ netCDF4 bugs with the backends refactor.

@dopplershift
Copy link
Contributor

dopplershift commented Oct 23, 2018

@lesserwhirls That's an interesting idea. (@rsignell-usgs That's the one.)

@rabernat What version of the conda-forge libnetcdf package is deployed wherever you're running?

@rabernat
Copy link
Contributor Author

$ conda list libnetcdf
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
libnetcdf                 4.6.1                h10edf3e_1    defaults

@rabernat
Copy link
Contributor Author

I created a binderized version of this issue with the latest dev xarray and fresh installs of all other packages:
https://github.com/rabernat/pangeo_esgf_demo

It appears to work fine!

@rsignell-usgs
Copy link

I fired up my notebook on @rabernat's binder env and it worked fine also:
https://nbviewer.jupyter.org/gist/rsignell-usgs/aebdac44a1d773b99673cb132c2ef5eb

@shoyer shoyer mentioned this issue Oct 24, 2018
5 tasks
@dopplershift
Copy link
Contributor

The original version of libnetcdf in @rabernat 's environment definitely had the opendap timeout issue. Not sure if that's the root cause of the problem, or not, but it's suspect.

@rsignell-usgs
Copy link

The version that is working in @rabernat's esgf binder env is:

libnetcdf                 4.6.1               h9cd6fdc_11    conda-forge

@dopplershift
Copy link
Contributor

That version has the fix for the issue.

@ocefpaf
Copy link
Contributor

ocefpaf commented Oct 24, 2018

That version has the fix for the issue.

I know that @jjhelmus ported the fix to defaults but I'm not sure which build number has it, and/or if the previous one was remove, b/c defaults builds are not as transparent as conda-forge's 😄

He can probably say more about that.

@dopplershift
Copy link
Contributor

Oh, I didn't even catch that the original was on defaults.

@jjhelmus
Copy link
Contributor

In defaults libnetcdf4 4.6.1 build 1 and above contain the timeout fix, build 0 has the original timeout.

@ocefpaf
Copy link
Contributor

ocefpaf commented Oct 24, 2018

In defaults libnetcdf4 4.6.1 build 1 and above contain the timeout fix, build 0 has the original timeout.

Thanks @jjhelmus! I guess that info and #2503 (comment) eliminates the timeout issue from the equation.

@jjhelmus
Copy link
Contributor

h10edf3e_1 contains the timeout fix and is build against hdf5 1.10.2. The conda-forge h9cd6fdc_11 build is against hdf5 1.10.3 perhaps that makes a different?

@ocefpaf
Copy link
Contributor

ocefpaf commented Oct 24, 2018

h10edf3e_1 contains the timeout fix and is build against hdf5 1.10.2. The conda-forge h9cd6fdc_11 build is against hdf5 1.10.3 perhaps that makes a different?

There are many variables at play here. The env that solved it in #2503 (comment) seems quite different from the env where the problem happened, including an xarray dev version. I'm not sure hdf5 is a good candidate to blame 😄

@jhamman
Copy link
Member

jhamman commented Jan 13, 2019

@rabernat - do think this was resolved? If I'm understanding the thread correctly, it seems this was a libnetcdf version issue. Feel free to reopen if I've got that wrong.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 9, 2019

This works with latest libraries.

@rabernat rabernat closed this as completed Apr 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants