Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of Enums instead of flag #238

Closed
bzah opened this issue May 9, 2023 · 15 comments
Closed

Usage of Enums instead of flag #238

bzah opened this issue May 9, 2023 · 15 comments
Labels
question Further information is requested or discussion invited

Comments

@bzah
Copy link

bzah commented May 9, 2023

I could not find in the current CF convention or in the discussions here if netCDF4's Enums are forbidden or not. I know that CF currently recommend to use flag_meanings|flag_values|flag_mask (see here or there), but there is a dedicated mean to define this kind of values in netCDF4 through Enum.

In short: I would like to know: is it CF valid to use netCDF4's Enum instead of a combination of flag_meanings and flag_values to describe an enumerated type ?

For reference:

Edit After some archeology (thanks Lars!), there are existing issues about this topic:

@bzah bzah added the question Further information is requested or discussion invited label May 9, 2023
@JonathanGregory
Copy link
Contributor

Thanks for the question, @bzah. I think the answer is that it's not CF-compliant to use netCDF4 Enums at the moment. CF has not yet adopted all features which were introduced in netCDF4. Of course it can be discussed whether Enums should be supported by CF. Probably it would be similar to flag_meanings and flag_values for the data model. I expect the main question to consider is whether the advantage of permitting Enum as an alternative to flag_meanings and flag_values justifies the cost to software and data-users. Principle 10 in sect 1.2 is about this kind of question:

Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one).

Jonathan

@samain-eum
Copy link

I think the answer is that it's not CF-compliant to use netCDF4 Enums at the moment.

The question is a bit more subtle than that and I will provide context below. It is clear that currently enumerations are not part of the convention. So we cannot expect any reading tool relying on CF to do anything with the enumeration data type information (raw data decoding is handled by the base library though, so it just appears like a normal variable).

An enumerated variable may have additional flag_values and flag_meanings attributes, in which case it seems it would be CF compliant. It's an inelegant solution though as it would result in duplicating the information, with the risk of unconsistencies.

The use case we are having now at EUMETSAT, is that we have netCDF4 products using a lot of enumerations as this was deemed useful regardless of the CF-compliance aspect. Considering that changing formats at this stage would be massively inconvenient, we are trying to identify the best route to make them more CF-friendly in a pragmatic way. One way is to add the missing attributes (which won't break our existing interfaces). Another one is to adapt the tools we make available to our users (e.g. python readers) to be able to parse the enumeration datatype and create the same metadata as if the CF attributes were present. The latter case is more of a short term workaround though and the first one might be needed for compatibility with other tools.

@bzah
Copy link
Author

bzah commented Jun 2, 2023

Could this be a topic for the cf-workshop in October ?

@davidhassell
Copy link
Collaborator

I think this sounds like a good topic for the workshop.

@bzah
Copy link
Author

bzah commented Oct 11, 2023

We discussed this issue during the last CF workshop.
Tagging people from the workshop @MTG-Formats, @JonathanGregory, @davidhassell, @czender, @DocOtak
It seems we agreed that Enums can be added to the convention.
This would require to:

  • Review the whole convention and figure out what would be the impact of Enums for each chapter.
  • Add a section to describe Enums themselves.
  • Add a link to this section and explanations in flags chapter to describe how Enums can be an alternative to flag_meanings and flag_values.

Additionally we would like to see the support for Enums improved in various tools (The following is merely a wish list, not requirements):

  • In netCDF C, we would like to be able to read variables that are normally typed with an Enum without actually reading the Enums and instead reading the underlying dtype.
    The will be helpful for libraries that do not yet support Enums, so that they can read these variables.
  • In netCDF C, we would like to see a proper workaround for the issue of _FillValue with Enums (see Incorrect fill value being used for enums (was: ncdump failure on enum file: dumplib.c; line 952) Unidata/netcdf-c#982 (comment))
    At the moment, when creating a variable without setting the _FillValue, the variable may be filled with the default _FillValue of the underlying dtype (e.g. 255 for a ubyte).
    This is an issue for Enums because this default _FillValue may not be a valid Enum value and will therefore crash even a standard tool like ncdump.
  • Support the implementation or raise feature request for Enums in most commonly used tools.

Notes:

  • It would not be possible (or very difficult?) to localize Enums.
  • Setting manually a FillValue for Enums may be performance intensive.
  • Yet, we currently recommend to:
    • Have a value prepared for _FillValue in the Enum definition (like "Missing": 0)
    • Set the fill_value explicitly when creating the variable.

Please feel free to comment, I might have missed/forgotten something.

@JonathanGregory
Copy link
Contributor

Dear Abel @bzah

Thanks for the useful summary. My impression from the discussion is that Enums are mostly a help for data-writers, to ensure they write the correct data. You have adopted them for that reason, and because the possible values and their meanings, for each Enum type, are defined only once in the file, not for each data variable, like for flag_values and flag_meanings.

For data-readers, and software that supports reading the data, Enums are more work. To be honest, I am not convinced that there is a strong enough case for adding them to CF, given that we already have a satisfactory method for the same thing, although I agree that we would probably have used Enums instead of inventing flags, if they had been available at the time. Of course, I'm aware that you have a lot of data which uses Enums, and you want it to be CF-compliant. Also, people have often asked, on general grounds, for features that came in with NetCDF-4 to be allowed in CF.

Because of the extra complexity, I think this point from our discussion is very important:

In netCDF C, we would like to be able to read variables that are normally typed with an Enum without actually reading the Enums and instead reading the underlying dtype.

If this were allowed, existing software could read the Enum data variable without being aware of Enums. If the Enum data variable also had the flag_values and flag_meanings attributes, it could be interpreted by existing software. In this way, Enum in CF would be a device to make sure flag values are written correctly and consistently within a file, and we would not need a new convention for interpreting them.

I wonder what the prospects are for the netCDF lib being altered in this way.

Cheers

Jonathan

@ChrisBarker-NOAA
Copy link

because the possible values and their meanings, for each Enum type, are defined only once in the file, not for each data variable, like for flag_values and flag_meanings.

I think it this is a more compelling argument for inclusion than either the data checking or ease for writers. It would could be helpful for consumers of the data as well to have it defined in one place. (that is, of there are use-cases -- I've never seen the same flag values used in multiple variables, but that's sut my limited experience)

In netCDF C, we would like to be able to read variables that are normally typed with an Enum without actually reading the Enums and instead reading the underlying dtype.

I'm confused -- can't you do that now? if it's stored as an int with flag-value, it can be read and cast to an Enum, yes?

At least I'm pretty sure that's what we do in some of our code now.

@JonathanGregory
Copy link
Contributor

Dear Chris

In netCDF C, we would like to be able to read variables that are normally typed with an Enum without actually reading the Enums and instead reading the underlying dtype.

I'm confused -- can't you do that now? if it's stored as an int with flag-value, it can be read and cast to an Enum, yes?

At least I'm pretty sure that's what we do in some of our code now.

I have no experience of this myself, but our discussion group understood that the API would not let you obtain an Enum netCDF variable with the same function as you use for an integer netCDF variable, even though that is actually how it's stored on disk. That's my understanding of what was said, anyway. Abel @bzah or others more experienced can correct me!

Best wishes

Jonathan

@ChrisBarker-NOAA
Copy link

Ahh -- got it. I'm getting out of my area of expertise, but I think you can do essentially anything with C, but it would require custom code not built in to the netcdf C lib -- that makes sense. That could be a nice utility, yes.

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Oct 31, 2023

Ideally, we were thinking, existing code would require no change to read Enums. If I understand correctly, that implies a change in the netCDF lib or API. If that were done, and if Enum variables also had flag_values and flag_meanings, it would mean that existing code would be able to read and interpret Enum variables without even being aware that they were Enums.

@czender
Copy link

czender commented Oct 31, 2023

If a client code that know nothing about the netCDF4 enum type enounters a file that stores an enum variable, then it will probably fail if it tries to read the enum. The reason is that the client would first need to use the API to figure out what the base type (integer, unsigned int, byte, etc) of the enum is so that the client can allocate enough memory before getting the variable. In language like C, all this needs to be done explicitly through the API. A smart interpreted language (e.g., Python) wrapper would do all this memory querying and allocation automatically but would probably still fail unless it had been specially coded to retrieve and treat the enum as the underlying base type rather than as a netcdf4 enum. My guess is that the functionality @JonathanGregory is wondering about would require new API functions e.g., nc_get_var_as_base_type_rather_than_user_defined_type(), and existing code would have to be recoded to always use those API functions rather than the default nc_get_var() functions.

@JonathanGregory
Copy link
Contributor

Dear Charlie @czender

Thanks for your explanation. That's helpful. Yes, I was hoping (in my comment here, and in the discussion) that no change would be required to existing code. I suppose that means changing the effect of the existing get functions so that an Enum is fetched as its storage type. Then a different API enquiry would be needed to discover if this basic type is being used as an Enum.

However, if only a minimal change to existing code is required, such as you suggest, that would not be too bad as a price for supporting Enums in a minimal way.

Best wishes

Jonathan

@czender
Copy link

czender commented Nov 1, 2023

@JonathanGregory After further thought, I need to partially walk back my opinion (above) from yesterday. If a client program is "cautious enough" to allocate enough memory to hold the base type values of a netCDF4 enum variable, then it might succeed without any code modifications. Yesterday I said the program must know the base type of the enum. However, if the client just blindly reads the enum the standard nc_get_var() API and provides enough memory then it might work assuming that the netCDF library does not do a safety check that matches the type of the on-disk variable against the type of the buffer that the variable will be read into (the API requires specifying the type of destination buffer and it might use this information for safety checks). Whether netCDF does such a check, I don't know. Someone would need to test this to be sure. The library could, under the hood, just prevent the API from reading enums directly into integer types. Of course the attempt will definitely be buggy and likely fail if the client does not provide enough memory to hold the base type of the enum, e.g., if the client expects unsigned int32 and the base type is actually unsigned int64. Anyway, the situation is not quite as pessimistic as I thought yesterday.

@PythonCHB
Copy link

Of course the attempt will definitely be buggy and likely fail if the client does not provide enough memory to hold the base type of the enum, e.g., if the client expects unsigned int32 and the base type is actually unsigned int64.

This strikes me as simply the nature of C :-)

However, it seems the question, to some extent, is to what degree the netcdf-C lib should support CF out of the box -- I have no idea what Unidata thinks of this. Is there a C CF lib out there?

Of course, Charlie has done a lot in NCO -- which is great!

@bzah
Copy link
Author

bzah commented Apr 16, 2024

I haven't been working on this topic as much as I should have, but from my understanding Enums do not yet meet the criteria to be added to CF.

For this, we would need to have:

  1. Good support from the netcdf reading libraries to read and write Enums.
  2. Have mechanisms in these library to seamlessly translate flag_values/flag_meanings into Enum and vice-versa.
  3. Have more users, and not only one data-writer, who which to use Enums.

[1] and have been met in the python xarray package, but this is merely one library and it required quite some work to get it merged. [2] was initially on the road plan for the same xarray PR, but we thought it would fit better on cf-xarray, so it's not yet implemented.
And, I suppose until [3] is reached, that is until more people are interested in using Enums, other libraries contributors will not spend the resources to implement such feature.

So, I suggest we close this issue for now, perhaps to revisit if Enums gain interest from other parties (or if EUMETSAT is willing to invest a lot in open-source contributions to make [1] and [2] possible @MTG-Formats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested or discussion invited
Projects
None yet
Development

No branches or pull requests

7 participants