Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing_value is a string for numeric NetCDF array (getindex fail with v0.12.0) #166

Closed
gaelforget opened this issue Mar 11, 2022 · 15 comments

Comments

@gaelforget
Copy link
Contributor

Describe the bug

Indexing a Dataset variable by name, in a case that used to work until at least v0.11.18, now returns an error with v0.12.0. Could have to do with the introduction of SymbolOrString but unclear.

To Reproduce

run(`wget ftp://ftp.aoml.noaa.gov/pub/phod/lumpkin/hourly/v1.04/netcdf/gps/drifter_10050130.nc`)
ds=Dataset("drifter_10050130.nc")
haskey(ds,"longitude")
ds["longitude"]

Expected behavior

ds["longitude"] should work. The example quoted here has been part of the test suite for OceanRobots.jl for a while.

Environment

Full output

ds["longitude"]
ERROR: MethodError: no method matching Float32(::String)
Closest candidates are:
  (::Type{T})(::AbstractChar) where T<:Union{AbstractChar, Number} at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/char.jl:50
  (::Type{T})(::Base.TwicePrecision) where T<:Number at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/twiceprecision.jl:255
  (::Type{T})(::Complex) where T<:Real at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/complex.jl:44
  ...
Stacktrace:
 [1] _broadcast_getindex_evalf
   @ ./broadcast.jl:670 [inlined]
 [2] _broadcast_getindex
   @ ./broadcast.jl:643 [inlined]
 [3] getindex
   @ ./broadcast.jl:597 [inlined]
 [4] copy
   @ ./broadcast.jl:875 [inlined]
 [5] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, Type{Float32}, Tuple{Base.RefValue{String}}})
   @ Base.Broadcast ./broadcast.jl:860
 [6] getindex(ds::NCDataset{Nothing}, varname::String)
   @ NCDatasets ~/.julia/packages/NCDatasets/Iaww4/src/cfvariable.jl:373
 [7] top-level scope
   @ REPL[8]:1
@visr
Copy link
Member

visr commented Mar 11, 2022

Huh that's odd, there is a missing_value with a string value in this dataset.

	float longitude(traj, obs) ;
		longitude:long_name = "Longitude" ;
		longitude:units = "degrees_east" ;
		longitude:axis = "X" ;
		longitude:valid_min = "-180.f" ;
		longitude:valid_max = "180.f" ;
		longitude:standard_name = "longitude" ;
		longitude:missing_value = "-1.e+34f" ;

It looks like that case is not handled by this commit: 083f23f#diff-014bcddfc04a56e29ce6e91a3f2ab177d429989fd4709895908b2937ada9feafR372

Where it tries to convert the value to the eltype of the array.

@gaelforget
Copy link
Contributor Author

Huh that's odd, there is a missing_value with a string value in this dataset.

Good point!

Not sure why they did that. Also I don't understand why "-1.e+34f" rather than "-1.e+34", why valid_min = "-180.f" ;, and so on. I guess there is no guarantee that everyone deals with these attributes perfectly well when creating files.

I sent a PR with a rudimentary fix in #167

@Alexander-Barth
Copy link
Member

NCDatasets 0.12 is indeed the first release which parses missing_value (and replace them as Missing).
I guess a similar issue can also arrise if the NetCDF variable is of type Int and the missing value is say 0.5 (which cannot be converted to a Int). I am wondering if we need any convertion at all as 123 == "some string" is not an error.

@Alexander-Barth
Copy link
Member

@Gael can you try the branch "not_convert_missing_values" ?

@Alexander-Barth
Copy link
Member

By the way, this is what I get in Python's netCDF4:

In [1]: import netCDF4
In [2]: netCDF4.Dataset("drifter_10050130.nc")["longitude"][:]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [2], in <module>
----> 1 netCDF4.Dataset("drifter_10050130.nc")["longitude"][:]

File src/netCDF4/_netCDF4.pyx:4445, in netCDF4._netCDF4.Variable.__getitem__()

File src/netCDF4/_netCDF4.pyx:4514, in netCDF4._netCDF4.Variable._toma()

File src/netCDF4/_netCDF4.pyx:4801, in netCDF4._netCDF4.Variable._check_safecast()

ValueError: could not convert string to float: '-1.e+34f'

In [4]: netCDF4.__version__
Out[4]: '1.5.8'

Maybe it would be good to contact the author of the dataset.

@Alexander-Barth Alexander-Barth changed the title getindex fail with v0.12.0 missing_value is a string for numeric NetCDF array (getindex fail with v0.12.0) Mar 14, 2022
@Alexander-Barth
Copy link
Member

The branch not_convert_missing_values is now merged into master with a test case:

https://github.com/Alexander-Barth/NCDatasets.jl/blob/91b13156963af49829c384b0c1fea2b4b7f30a39/test/test_missing_value.jl#L30-L33

Note that we do not try to parse the strings. The string missing value is effectively ignored (for numeric NetCDF variables).

If you want to use the missing value you can use the new function cfvariable (for the upcoming NCDatasets 0.12.1):

nv = cfvariable(ds,"longitude", missing_value = -1.e+34);  # instead of nv = ds["longitude"];
lon = nv[:]

I can also read your file drifter_10050130.nc.

@gaelforget
Copy link
Contributor Author

Hi @Alexander-Barth

Thanks for the fix. I had sent PR #167 which kept the type conversion when missing value is not a text string and skipped it otherwise. Seemed useful to me but maybe you felt this was still dangerous?

Thanks for trying this out on Python's netCDF4 too. Good to know this is an issue also for them. Tagging @selipot who might know who to pass the information over to at NOAA ( the file is from https://www.aoml.noaa.gov/phod/gdp/hourly_data.php )

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Mar 15, 2022

Seemed useful to me but maybe you felt this was still dangerous?

Well, the PR solves your issue, but this related issue:

if the NetCDF variable is of type Int and the missing value is say 0.5 (which cannot be converted to a Int).

would still produce an error if the missing values would be converted.

@gaelforget
Copy link
Contributor Author

if the NetCDF variable is of type Int and the missing value is say 0.5 (which cannot be converted to a Int).
would still produce an error if the missing values would be converted.

I see. We could do a try and catch for the non-string case. I don't have a way to test the error you imagined but am happy to edit the PR to this end. Should I do that?

@selipot
Copy link

selipot commented Mar 15, 2022

@philippemiron may have some insight

@Alexander-Barth
Copy link
Member

Thanks for your help Gael, but your PR is no longer needed. For a numeric array, the missing value should also be numeric (not a string). NCDatasets can now read the file (but it doesn't attempt to parse the string attributes, which would not be possible anyway in this case).

@rabernat
Copy link

Has someone contacted the dataset owner and told them that they are serving a malformed dataset?

Speaking from the experience of an xarray dev, you can potentially go down many rabbit holes trying to deal with malformed datasets--special cases and workarounds in your code, etc. This may be worthwhile in some cases. But it's also important to push back on the data providers who serve these datasets.

By silently ignoring the missing_value attribute if its data type is wrong, you create the possibility of new bugs for downstream code that assumes that missing_value is being handled. Perhaps the best solution is to emit a warning.

@selipot
Copy link

selipot commented Mar 23, 2022

I am in touch with the GDP DAC to get this fixed. Bear with me folks!

Alexander-Barth added a commit that referenced this issue Mar 24, 2022
@Alexander-Barth
Copy link
Member

By silently ignoring the missing_value attribute if its data type is wrong, you create the possibility of new bugs for downstream code that assumes that missing_value is being handled. Perhaps the best solution is to emit a warning.

In fact, in the current version (v0.12.2) of NCDatasets, there is no code that actively ignores string missing_values for numeric data types. The comparison (1 == '1') is just always false. In Julia, Python and C++ (for characters) this is also false without error (there is only an error in C++ when comparing numbers and strings). So strictly speaking, the missing value is "handled" (😀) in the sense that every element is compared to the string missing value in the case, but the result it always false.

Luckily, missing_value is deprecated (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch02s05.html):

The missing_value attribute is considered deprecated by the NUG and we do not recommend its use. However for backwards compatibility with COARDS this standard continues to recognize the use of the missing_value attribute to indicate undefined or missing data.

The consistency of _FillValue data type is enforced by the library.

In any case, I agree that a warning would be nice. For the file above you will see a warning like this:

┌ Warning: variable 'longitude' has a numeric type but the corresponding missing value ("-1.e+34f") is 
a character or string. Comparing, e.g. an integer and a string (1 == "1") will always evaluate to false.
See the function NCDatasets.cfvariable how to manually override the missing_value attribute.
└ @ NCDatasets ~/.julia/dev/NCDatasets/src/cfvariable.jl:363

Thanks a lot to to @selipot for your reactivity !

@rabernat
Copy link

Thanks a lot for the clarification. Makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants