-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete Data Reading from URL #836
Comments
So I ran your code just now and got a complete image with no missing data. It's possible the data file on the server just wasn't complete. Was this the most recent file when you tried? |
Originally, it was the most recent file, but I just reran the code and got 44037412 missing pixels. When I try the code on a different machine (with 1.3.1), I see the same as you: No missing data. I tried creating a new barebones conda environment (brings in netCDF4 1.4.0 by default) and I still got missing pixels. Then I tried an environment with netCDF4 1.3.1, since that worked on the other machine, and, lo and behold, there are no missing data! Which version of netCDF4 are you running? I could try 1.4.1, which I see has been released, I guess through a manual install, if that would be useful. |
My results were with 1.4.1 from conda-forge on my mac. |
Works for me with 1.4.0 on Python 3.6 |
OK, I just tried a conda-forge-based environment with 1.4.1, and I am getting missing pixels. Here is the output of
The environment that works on my machine:
|
I don't have a Mac handy, but trying a Windows 10 machine (the other tests were on CentOS 7 machines) makes matters even more confusing: I have missing pixels with both 1.4.1 and 1.3.1. |
As a final test for now, I took the same Windows 10 machine home, so it's on my home network rather than the campus network, and there was no change in the results. |
So we have all this data...and I have no clue what conclusion to draw from it. 😆 |
My conclusion: Use a Mac! Interestingly, I found a different conda environment on CentOS 7 that has missing pixels even with 1.3.1, but it has libnetcdf 4.6.1 (in common with my other broken environment). So my current working hypothesis is that the Python module isn't the issue, libnetcdf is. 4.5.0 is OK (on CentOS anyway), but 4.6.1 is not. |
The combination netcdf4 1.3.1 and libnetcdf 4.5.0 also works on Windows 10. It appears it is libnetcdf 4.6.1 that is at fault. |
Unfortunately for that theory (which might not be completely wrong), 4.6.1 is working fine for me here. What happens if instead of |
OK, I made that tweak. Running the revised program on CentOS with 1.4.1/4.6.1 results in a hang. Even Control-C didn't give me a prompt back, but using |
I ran the program a second time, and after 14 minutes, it finally crashes with:
|
However, I see the server is indeed down, so the above may be a red herring. |
Now that the server is back up, I can report the program hangs on Windows 10 with 1.4.1/4.6.1 as well. |
@sgdecker if I just read this issue quickly and it seems that the issues are all on Windows, right? I'll try to get a Windows machine and test it. Note that * We had issue building it for Python 2.7, backporting patches that fixed OPeNDAP problems, etc. |
@ocefpaf no I am also having the issue on CentOS, which is exhibiting the same behavior for me as Windows 10: 1.3.1/4.5.0 is fine, but 1.3.1/4.6.1 and 1.4.1/4.6.1 are not. |
I think this might help us narrow this problem down. I have two different linux machines, each with identical versions of
One works fine, the other displays the problem! @DennisHeimbigner does netcdf4 access of OPeNDAP URLs depend on any other packages or could there be some difference caused by internet connectivity/timeout? I tried setting |
I just got my hands on a Mac, and, contrary to @dopplershift 's experience, I am seeing the same behavior (with the original test program) as with the other machines I've tried: 1.4.1/4.6.1 has missing data, but 1.3.1/4.5.0 is fine. |
This makes me think so kind of network issue is involved--I'm cheating because I'm sitting next to the data. 😁 I will say that I do see a problem with the slicing I suggested. It's not technically a hang; it just took > 45 minutes to error out with an Addendum: Oh W.T.F. 😡 (cc @lesserwhirls @jrleeman )
The client is requesting individual grid points, so it's making a round-trip request to download. 3. bytes. 😱 That seems...like something that should be improved. |
I just tested many versions of the |
Is someone in a position to run a packet sniffer on this. In particular, to look |
If @sgdecker (or anyone seeing the missing data) can send me their IP address (here or email), I can look at the server logs and see if I see anything. |
I should be able to run |
My CentOS machine is 165.230.171.64 |
Sorry, I have not used a packet sniffer in a long time, so I can't help. |
Opened #838 for issue about the strided access taking forever. |
So, I've been able to reproduce on my system without changing my environment, by running from my home (even though I've got a 1Gb/s connection). So here's a log of my opendap download attempts:
The interesting parts to glean are that the request should return 240000223 bytes. For netcdf 4.5.0, this amount of data is always returned. For 4.6.1, this amount only seems to succeed if the time of the connection is less than, it seems, 10000 (ms?). This time effect is even more prevalent on @sgdecker 's downloads:
Here we see with 4.6.1, his connections always stop right around 10000, whereas with 4.5.0 it lasts until 20k, and the correct amount of data is returned. On my machine, I can confirm 4.6.1 is problematic both from conda-forge and Homebrew. 4.5.1 seems to have no problem. This is in a conda environment where the only thing changing is libnetcdf (no changes to netcdf4-python or libcurl). @DennisHeimbigner did something about opendap/curl change in 4.6.1? (cc @WardF ) |
I do not think so, but I will have to review the pull requests. |
@DennisHeimbigner This is all using the same running TDS 5.0 instance on http://thredds-test.unidata.ucar.edu . |
always use nc_get_vars for strided access over http (issue #836)
Add the ability to set some additional curlopt values via .daprc (aka .dodsrc). This effects both DAP2 and DAP4 protocols. Related issues: [1] re: esupport: KOZ-821332 [2] re: github issue Unidata/netcdf4-python#836 [3] re: github issue #1074 1. CURLOPT_BUFFERSIZE: Relevant to [1]. Allow user to set the read/write buffersizes used by curl. This is done by adding the following to .daprc (aka .dodsrc): HTTP.READ.BUFFERSIZE=n where n is the buffersize in bytes. There is a built-in (to curl) limit of 512k for this value. 2. CURLOPT_TCP_KEEPALIVE (and CURLOPT_TCP_KEEPIDLE and CURLOPT_TCP_KEEPINTVL): Relevant (maybe) to [2] and [3]. Allow the user to turn on KEEPALIVE This is done by adding the following to .daprc (aka .dodsrc): HTTP.KEEPALIVE=on|n/m If the value is "on", then simply enable default KEEPALIVE. If the value is n/m, then enable KEEPALIVE and set KEEPIDLE to n and KEEPINTVL to m.
add 'master_file' kwarg to MFDataset.__init__ (issue #836)
Maybe a related problem, maybe different, but the following (as a script):
consistently fails for me with conda-forge libnetcdf 4.6.0 and 4.6.1, but not with 4.5.0-3. Output is:
This is with a Mac and with Linux. With the URL above, the request works for some time slices, but not for others. With a related URL, I can't even get a single time slice; the following fails:
|
@dopplershift at this point I'm inclined to revert all of conda-forge's builds to It will be quite painful to re-build everything but at least users will get a working environment again. (Note that wheels are still building with 4.4.x, due to Python 2.7 on Windows.) |
Can someone try this experiment. |
|
There may be two problems here.
Let me suggest another experiment. Meanwhile, I will see if I am the one setting the timeout to 10 seconds. |
OK, now it looks like I am waiting for 1000 s to elapse. |
That is odd. It implies that the server is taking an very long time to In any case, I do set the default timeout to 10 seconds. Should I change |
On the other hand, the uwnd variable is 152692623360 bytes (~152 gig), |
As an aside, I have a fix in to allow the setting of the curl download buffer size |
But I was downloading only one time slice, so it should have been quick. It took over 500 s. It looks like at least part of the problem here is server-side. I'm trying this url now with the earlier libnetcdf, and it is still taking a long time. I'm in contact with the people running the server, so we can look into that.
|
As you can see above, the download was only 628x1440 floats, so unless they had to be collected from a huge number of files it should have been very fast. I would be surprised if a single time slice is not in a single file. |
I'm beginning to think that the problem I originally reported might be an interaction between the timeout value and a badly-behaving (or at least oddly-behaving) server, since the test above took slightly longer (611 s) just now with libnetcdf 4.4.1.1. |
@DennisHeimbigner I'm not sure about @efiring 's server, since it looks like it's a straight DODS server, but the original problem was from our own TDS running on http://thredds.ucar.edu. I would say that 10 seconds is way too quick. If that's the time the connection takes to finish, that's going to have to allow for download time. Speaking as someone on hotel wifi right now, we cannot be assuming that users have a performant network connection. So if that means setting it to something large like 1800 (half an hour) or just disabling it altogether, I leave that up to the netCDF team or other interested parties to decide. But I can say that with netCDF 4.5, the connections were finishing in 15-25 seconds, so the new 10 seconds is (relative to real-world) ridiculously short. |
I am going to set the default timeout to a much larger value. I dislike setting it to The slow response time from the server is baffling. Some possible causes (all unlikely):
|
@DennisHeimbigner while you fix this upstream I'm looking for a short term solution to the users of the packages we are building on
Both 1 and 2 are a lot of work and will take some time. So my question is: is there a way to do 3 with a flag or something? Or, if you point me to a PR when you address it, I can try to backport it. |
You also can use .daprc to set the timeout as I indicated in a previous comment. |
My guess is that a
I'll give that a try. Thanks. |
Yes, each user would have to set this up. |
The dataset mentioned in #836 (comment) seems to be gone but I tested the longer time out PR with @efiring's example and, after a long wait, I get the correct data: Here are the results for the same test under other versions, sometimes I get all zeroes, sometimes I get the proper data: |
I am looking for a script to extract the variable T3P1_IntegralProtonFlux from files and create a CSV or SQLITE file from them. Ultimately, I want to combine data from files from 1/1/2019 to today. |
Example Code
Problem
I am not sure if this is an issue with netCDF4, the server, or my machine, but I am getting a value on the order of 40 million printing out, and the plot shows incomplete data:
Each time I run the example code, I get slightly different results. The last three runs indicated 42159004, 38724464, and 44907954 missing pixels, respectively.
I would expect either a value of 0 printing out (and no missing data), or some sort of error message indicating there was trouble retrieving the data.
I am using 1.4.0
The text was updated successfully, but these errors were encountered: