-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with distributed and opendap netCDF endpoint #2503
Comments
I'm wondering if there is some authentication that is not being properly distributed to the workers. I'm not actually sure how opendap works in this case. Perhaps @dopplershift has some ideas here? Has anyone, maybe @rsignell-usgs, used the kubernetes cluster with opendap endpoints? |
This is fairly high priority for me, as it relates to the ongoing project to access CMIP6 data from an ESGF node running in google cloud (pangeo-data/pangeo#420). |
I tried a similar workflow last week with an AWS kubernetes cluster with opendap endpoints and it also failed: https://nbviewer.jupyter.org/gist/rsignell-usgs/8583ea8f8b5e1c926b0409bd536095a9 I thought it was likely some intermittent problem that wasn't handled well. In my case after a while I get:
|
Just so I'm clear on how the workflow looks:
Certainly does seem like there's something stale in what the remote workers are getting. Confused why it works for the others, though. I can prioritize this a bit and dig in to see what I can figure out--though I'm teaching through tomorrow. May be able to dig into this while at ECMWF. |
FWIW, in my workflow there was nothing fundamentally wrong, meaning that the requests worked for a while, but eventually would die with the So for just a short time period (in this case 50 time steps, 2 chunks in time), it would usually work: |
Perhaps it's also worth mentioning that I don't see any errors on the THREDDS server side on either the tomcat catalina or thredds threddsServlet logs. @lesserwhirls, any ideas? |
@rsignell-usgs - are you able to tell if multiple processes (workers) have authenticated on the server side? I think this detail would really help us isolate the problem. |
The "short time period" makes me wonder...@dopplershift - could this be due to the netCDF-C / curl timeout issue you mentioned today? |
@jhamman, doesn't this dask status plot tell us that multiple workers are connecting and getting data? |
@lesserwhirls , is this the issue you are referring to? Unidata/netcdf4-python#836 |
@rabernat have you tried using the development version of xarray? I think we fixed a few serialization/ netCDF4 bugs with the backends refactor. |
@lesserwhirls That's an interesting idea. (@rsignell-usgs That's the one.) @rabernat What version of the conda-forge libnetcdf package is deployed wherever you're running? |
|
I created a binderized version of this issue with the latest dev xarray and fresh installs of all other packages: It appears to work fine! |
I fired up my notebook on @rabernat's binder env and it worked fine also: |
The original version of libnetcdf in @rabernat 's environment definitely had the opendap timeout issue. Not sure if that's the root cause of the problem, or not, but it's suspect. |
The version that is working in @rabernat's esgf binder env is:
|
That version has the fix for the issue. |
I know that @jjhelmus ported the fix to He can probably say more about that. |
Oh, I didn't even catch that the original was on defaults. |
In |
Thanks @jjhelmus! I guess that info and #2503 (comment) eliminates the timeout issue from the equation. |
h10edf3e_1 contains the timeout fix and is build against hdf5 1.10.2. The |
There are many variables at play here. The env that solved it in #2503 (comment) seems quite different from the env where the problem happened, including an |
@rabernat - do think this was resolved? If I'm understanding the thread correctly, it seems this was a libnetcdf version issue. Feel free to reopen if I've got that wrong. |
This works with latest libraries. |
Code Sample
I am trying to load a dataset from an opendap endpoint using xarray, netCDF4, and distributed. I am having a problem only with non-local distributed schedulers (KubeCluster specifically). This could plausibly be an xarray, dask, or pangeo issue, but I have decided to post it here.
In the worker log, I see the following sort of errors.
Ultimately, the error comes from the netCDF library:
RuntimeError('NetCDF: Not a valid ID',)
This seems like something to do with serialization of the netCDF store. The worker images have identical netcdf version (and all other package versions). I am at a loss for how to debug further.
Output of
xr.show_versions()
xr.show_versions()
cube_client.get_versions(check=True)
The text was updated successfully, but these errors were encountered: