Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of groups in CF #144

Closed
erget opened this issue Aug 3, 2018 · 58 comments · Fixed by #145
Closed

Support of groups in CF #144

erget opened this issue Aug 3, 2018 · 58 comments · Fixed by #145
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@erget
Copy link
Member

erget commented Aug 3, 2018

As discussed in CF meeting on 20 June 2018 in Reading, UK, we'd like to add support for the use of groups in CF files. Charlie and I have drafted an appropriate pull request containing the suggestions we'd like to implement.

Basically, the idea is to allow elements in CF files to refer to others which are not in the same group. The proposal defines ways of locating variables in this situation and tries to capture current ways of doing so "in the wild".

I'll update this issue with further material on this (Pull Request reference, summary presentation discussed at the meeting last June, and the proposal as drafted up till that point) as soon as the PR has an URL.

We haven't updated the conformance rules and checker yet, as we want to agree on the content first.

@erget erget mentioned this issue Aug 3, 2018
2 tasks
@erget
Copy link
Member Author

erget commented Aug 3, 2018

Further resources:

@JonathanGregory
Copy link
Contributor

This issue is implemented in #145

@JonathanGregory
Copy link
Contributor

Dear Daniel and Charlie

Thanks for this proposal. I think it's a sensible way to support groups, and I agree with it, except that I'm not convinced about the lateral search. You don't sound convinced about it either, in saying that it's allowed for backward-compatibility and may be deprecated in future. Why not deprecate it now (meaning that the CF-checker would give a warning)? Why allow it at all? Is there a more specific extra search rule you could provide instead of your general lateral algorithm to deal with the existing datasets you have in mind?

I think that at the meeting we talked about how you would flatten a file, if you didn't want groups. Could we include that aspect in the proposal? If it's reversible, that would be neat.

I feel that the text could be made simpler and easier to understand, and most of my suggestions below are along those lines. Although I think I get the spirit of your search rules, I have to say I don't properly follow them in detail. The text seems obscure to me - maybe you could say it more plainly, and diagrams would help, I expect.

  • In ch01 you propose a number of definitions. However, you don't propose to use these terms anywhere in the convention outside your new subsection of ch02. Hence I suggest you start your ch02 subsection with these definitions, instead of putting them in ch01, so they're in the place where the reader of ch02 will see them.

  • You haven't defined what you mean by ancestor group, sibling group, descendant group, identifier and element. Why not say "nearest" instead of "most proximal"? What does "nearest" mean in the description of "local apex group"?

  • For the first sentence of the new Groups subsection, I would suggest, "Groups provide a mechanism to structure data hierarchically within files." It doesn't seem especially "powerful" to me - it's just what you'd expect it to be! It's also not the only way, since hierarchy can equally well be done with directories.

  • Are "object" and "element" the same? If so, I'd use only one of these terms, and "element" seems better to me because it's not language-specific. Say what sort of thing can be an "element" - a dimension? a variable? of any role?

  • Say what an "identifier" is and in which contexts in CF it can occur. What is a "location"?

  • "Resolves to" is programmatic jargon - could you say this in more ordinary language?

  • Is the convention of paths to name groups and elements something that you are defining here as part of CF, rather than built into netCDF? I guess that it is, and if so, I would state explicitly that you are doing so, because it reads as though you assume the reader knows what is meant by a path to a group. As part of this, please explain how to interpret a path.

  • In a path, is . allowed? Is // allowed?

  • Does the restriction on out-of-group refs ("The dataset producer etc.") apply only to the case of absolute paths, or in all cases? If the latter, it should be stated before or after the three cases, I suppose.

  • In "Search by absolute path", what does "nearest dimension" mean?

  • In "Search by absolute path" you use the phrase "out-of-group", before its definition, which appears further down.

  • In "Search by proximity", say what "with no path" means. What's a "direct ancestor" - I mean, could an ancestor be "indirect"? I would combine the Note with the paragraph which follows, because in the Note it's not obvious what strategy you're talking about i.e. the "special case".

  • I think that you shouldn't list the attributes concerned, because this is redundant within the convention. This information is in Appendix A. If it's repeated here, it'll probably become inconsistent. If you want, you could give just a single example from each list e.g. title, units.

  • Is the last paragraph saying that groups can have attributes, like global attributes? If so, please say that - "being present within" sounds vague. Does a group attribute override a global attribute? Are global attributes assumed (unless overridden) to apply to all variables in all groups? I suggest that this part should come in the same place as the text about global attributes, before the text about variable attributes. I think that (what is currently) the last sentence isn't necessary - that's implied by the previous sentence, isn't it?

I hope that helps. Best wishes

Jonathan

@erget
Copy link
Member Author

erget commented Aug 15, 2018

Dear Jonathan,

Thanks for reviewing this.

Inline responses

I'm responding to your comments inline and abridging as I determine relevant; if there's anything you're missing, let me know.

... I'm not convinced about the lateral search. You don't sound convinced about it either, in saying that it's allowed for backward-compatibility and may be deprecated in future. Why not deprecate it now (meaning that the CF-checker would give a warning)? Why allow it at all? Is there a more specific extra search rule you could provide instead of your general lateral algorithm to deal with the existing datasets you have in mind?

I have to agree with you on this, and this is in line with my understanding of the discussion at the meeting in Reading. @czender could you say a few words on this?

I think that at the meeting we talked about how you would flatten a file, if you didn't want groups. Could we include that aspect in the proposal?

@czender has some nice material on this in the original proposal which I omitted for brevity. Would you propose importing this section (possible in edited form) into the pull request? I struggled with not adding too much text to the Conventions themselves, but we could include it as an appendix chapter. I do worry about overloading people new to the game with text.

The text seems obscure to me - maybe you could say it more plainly, and diagrams would help, I expect.

Diagrams are always good. In the original proposal we had:

Diagram of search algorithm

To my knowledge it would be the only diagram in the whole of the Conventions. As a new contributor I didn't want to snub current practices, but I see no problem with including it, if the community is happy with that.

In ch01 you propose a number of definitions. However, you don't propose to use these terms anywhere in the convention outside your new subsection of ch02. Hence I suggest you start your ch02 subsection with these definitions, instead of putting them in ch01, so they're in the place where the reader of ch02 will see them.

I don't really agree with this - having multiple glossaries scattered throughout the document only makes the terms easy to find if everybody is reading them in electronic form, and in that case we could just as well define terms next to their first use. So I prefer to keep all definitions together, as has been done for all the other definitions thus far.

You haven't defined what you mean by ancestor group, sibling group, descendant group, identifier and element. Why not say "nearest" instead of "most proximal"? What does "nearest" mean in the description of "local apex group"?

I think it would be a good idea to add these to the glossary. @czender, do you have an opinion on "nearest" vs. "most proximal"?

Are "object" and "element" the same? If so, I'd use only one of these terms, and "element" seems better to me because it's not language-specific. Say what sort of thing can be an "element" - a dimension? a variable? of any role?

@czender I see no difference here and propose adopting "element". Please correct me if I'm overlooking something.

Is the convention of paths to name groups and elements something that you are defining here as part of CF, rather than built into netCDF? I guess that it is, and if so, I would state explicitly that you are doing so, because it reads as though you assume the reader knows what is meant by a path to a group. As part of this, please explain how to interpret a path.

I understand what you're saying here, but since we say that paths are treated as UNIX paths, I think it would be overredundant to describe how to follow a path according to the UNIX conventions. Do you think it would be valuable to link to the POSIX pathname resolution specification? The same comment applies to your following bullet point, which I have omitted.

Does the restriction on out-of-group refs ("The dataset producer etc.") apply only to the case of absolute paths, or in all cases? If the latter, it should be stated before or after the three cases, I suppose.

You're right, this should be noted at the beginning of the chapter, since it applies to all out of group references. For our reference I note that it's in ch02.adoc, line 190.

In "Search by proximity", say what "with no path" means. What's a "direct ancestor" - I mean, could an ancestor be "indirect"? I would combine the Note with the paragraph which follows, because in the Note it's not obvious what strategy you're talking about i.e. the "special case".

Agreed, I think we can describe this more clearly in the next update.

I think that you shouldn't list the attributes concerned, because this is redundant within the convention. This information is in Appendix A. If it's repeated here, it'll probably become inconsistent. If you want, you could give just a single example from each list e.g. title, units.

I agree with that too.

Is the last paragraph saying that groups can have attributes, like global attributes? If so, please say that - "being present within" sounds vague. Does a group attribute override a global attribute? Are global attributes assumed (unless overridden) to apply to all variables in all groups? I suggest that this part should come in the same place as the text about global attributes, before the text about variable attributes. I think that (what is currently) the last sentence isn't necessary - that's implied by the previous sentence, isn't it?

I agree that precedence should be clearly stated when group attributes conflict with global attributes.

Next steps

We'll update the PR to address the points as noted above and ping again when that's been done. In addition to the items noted explicitly, I propose placing the following new terms in the glossary:

  • Element (or object)
  • Identifier
  • Location
  • Resolves to - or change language
  • Nearest dimension
  • Ancestor, sibling, descendant group

@ethanrd
Copy link
Member

ethanrd commented Aug 15, 2018

In response to Jonathan's question on paths (or full names)

Is the convention of paths to name groups and elements something that you are defining here as part of CF, rather than built into netCDF?

I just wanted to mention that full names are supported (to varying degrees) by the netCDF C and Java libraries and the ncdump and nccopy utilities. Documentation, as far as I've found, is a bit sparse and scattered.

In the C library, you can find a group given its full name, and with a group (ncid) you can get its full name. But that doesn't work for variables or dimensions. The Java library provides similar functionality for groups, variables, and dimensions. The ncdump and nccopy utilities support full paths in the -v vars option. It also mentions relative paths but its really more partial path matching, e.g., g/v would match /g0/g/v and /g1/g/v and /g/v.

@steingod
Copy link

Given that, would that call for search by proximity as the only approach in section 5, Scope? Having all potential approaches complicates interpretation of the data and building services on top of them. Stating absolute or relative relations explicitly would be beneficial although I see the point of all being a C user.

@JonathanGregory
Copy link
Contributor

Dear Daniel

Thanks for your responses and comments. Answers to a few questions:

  • Yes, I think it would be fine to include Charlie's material on flattening and dismembering, except for the final paragraph about implementations, which isn't essential to the convention.

  • Including a proper diagram would be unprecedented, I agree, but I don't know a reason not to. Actually there are some tables which are like diagrams in chapter 9 - they used to be in colour, and were prettier like that.

  • If you're sure you want to put your definitions in ch01, then I'd suggest keeping them together all there, rather than interleaving them with other definitions, and give them a heading in ch01 indicating that they are relevant to groups. You could provide a reference to the later chapter on groups, and a backwards reference from there to the definitions. One reason for my original comment is that the considerable number of these group-related definitions, which are used only in one place in the document, seems unbalanced to me with respect to the existing definitions in ch01, which are used in multiple places.

  • Yes, I think it's OK to say that paths are like Unix paths. I must have missed that. Not every CF-netCDF user is a user of Unix, though, so it may still be useful to spell it out a bit.

  • It'll be interesting to hear Charlie's view on the lateral search.

I look forward to the next version. Thanks for your work.

Cheers

Jonathan

@czender
Copy link
Contributor

czender commented Aug 21, 2018

All,
I missed the original posting of this 19 days ago and the PR itself (one it left my tree) 15 days ago because my GitHub tag (as opposed to my name) was not used in this issue/thread until last week, when I was on vacation. Long way of saying I'm still digesting what's been discussed and triaging other items and will try to respond here by next week. Thanks everyone for contributing, great to see us inching faster toward the finish line.
I think now that I've contributed to this thread, I've been automatically subscribed (as are all of you) and will be notified of updates. For the future, as CF moves to GitHub, I suggest that relevant people be listed in the "Assignees" list in the upper right of this page so they don't miss out on the discussion as it occurs in real time. Perhaps CF has established some other role for "Assignees", though.
Charlie

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Aug 22, 2018 via email

@cameronsmith1
Copy link

It appears that these github hub messages are going to the cf-convention mail list rather than the cf-metadata mail list, so perhaps @czender is not on the cf-convetion mailing list?

@czender
Copy link
Contributor

czender commented Aug 22, 2018

Yes, I only learned there was a cf-convention mail list (distinct from cf-metadata) yesterday :)

czender added a commit to czender/cf-conventions that referenced this issue Sep 7, 2018
Merge Daniel's Groups branch into my Groups branch on 20180907, preparatory to more work to submit and updated PR that address recent discussions at CF cf-convention#144
@czender
Copy link
Contributor

czender commented Sep 22, 2018

FYI I'm finally making progress on this and expect to have a full response and update PR next week. Thank you for your patience.

czender added a commit to czender/cf-conventions that referenced this issue Sep 26, 2018
@czender
Copy link
Contributor

czender commented Sep 27, 2018

All, the changes described below were just submitted as a PR to Daniel's tree, which hopefully means they will modify #145 once accepted. I'm non-expert at Github, so perhaps that is inefficient. The original is viewable in my tree at https://github.com/czender/cf-conventions/tree/groups

I'm responding to your comments inline and abridging as I determine relevant; if there's anything you're missing, let me know.

... I'm not convinced about the lateral search. You don't sound convinced about it either, in saying that it's allowed for backward-compatibility and may be deprecated in future. Why not deprecate it now (meaning that the CF-checker would give a warning)? Why allow it at all? Is there a more specific extra search rule you could provide instead of your general lateral algorithm to deal with the existing datasets you have in mind?

I have to agree with you on this, and this is in line with my understanding of the discussion at the meeting in Reading. @czender could you say a few words on this?

I must use many words. I wish I had been at the Reading conference to say this in person:

I may be the only defender of the lateral search feature so it falls on me to make the case for it. The one word that best summarizes why I think lateral search would be good for CF is "User-base". Geoscience researchers use an enormous amount of satellite-retrieved data stored with groups by providers notably including NASA and ESA. These agencies use HDF5EOS or netCDF4 to store data from dozens of platforms/instruments (Aura, MLS, OMI, S5P, TES, etc) and quite often data fields are in sibling (not ancestor) groups to the coordinates. For example, S5P has geophysical variables in /BAND1_RADIANCE/STANDARD_MODE/OBSERVATIONS and their coordinates (latitude and longitude) in the sibling group /BAND1_RADIANCE/STANDARD_MODE/GEODATA. MLS on Aura has geophysical variables in /HDFEOS/SWATHS/O3NadirSwath/Data\ Fields and latitude and longitude in /HDFEOS/SWATHS/O3NadirSwath/Geolocation\ Fields. The list of other examples exceeds my energy to type it.

This state of affairs developed over many years, and indicates that dataset producers like having coordinates in sibling groups of data fields. I doubt that NASA/ESA/etc. will start putting coordinates in ancestor groups rather than sibling groups just because CF recommends it, and even it they do, there will remain a decades-high mountain of data with the current organization. If/once CF adopts this groups proposal, one of three things is likely to happen:

  1. If CF supports lateral search then these agencies can comply with CF and continue storing data with the same organization as before, and use lateral search to find out-of-group (OOG) coordinates, should they prefer lateral search over relative/absolute paths. Relative/absolute paths are, IMHO, fragile and susceptible to breakage in downstream processing so avoiding them is best for long-term dataset interoperability.
  2. If CF DOES NOT support lateral search then these agencies can comply with CF and continue storing data with the same organization as before only by using relative/absolute paths to refer to OOG coordinates. This is fragile and IMHO a mistaken approach for long-term dataset interoperability.
  3. NASA/ESA/etc will start to design future datasets so that OOG coordinates are not in sibling groups. Existing data will not be CF-compatible unless it already uses relative/absolute paths.
    In my opinion it is best for users, data producers, and CF if some combination of (1) and (3) occurs, so that the sibling-oriented organization of existing datasets is "grandfathered in" to being CF-compliant by allowing lateral search, and producers start to deprecate the necessity of lateral search for future data products. Since many dataset producers have preferred lateral (rather than purely ancestral) dataset organization over the years and lateral search is the appropriate mechanism to resolve lateral associations, then CF should respect and support past dataset organization decisions and not impose relative/absolute paths as a requirement for earning CF-conformance for such datasets.

You haven't defined what you mean by ancestor group, sibling group, descendant group, identifier and element. Why not say "nearest" instead of "most proximal"? What does "nearest" mean in the description of "local apex group"?

I think it would be a good idea to add these to the glossary. @czender, do you have an opinion on "nearest" vs. "most proximal"?

To me "nearest" and "most proximal" mean the same thing. "Nearest" seems more vernacular, perhaps too vernacular? Nevertheless, I changed to "nearest" in the current PR and added to the glossary.

Are "object" and "element" the same? If so, I'd use only one of these terms, and "element" seems better to me because it's not language-specific. Say what sort of thing can be an "element" - a dimension? a variable? of any role?

See immediately below.

@czender I see no difference here and propose adopting "element". Please correct me if I'm overlooking something.

I think there is no difference in the way we use "element" and "object" in the text for Groups proposal. However, "element" is already frequently used in CF to refer to elements of an array, definitely not the meaning that Groups intends to convey. In practice "objects" as used in the Groups proposal can only be variables (including coordinate variables, of course), and dimensions. In theory a group could itself be an "object", however the proposal does not at this point need to do that. Off-hand, I can't think of an instance where an out-of-group (OOG) attribute needs to be explicitly referred to. Attributes of OOG variables can be important, yet they are always referred to via the OOG variable, and there is never a direct reference to an OOG attribute. It's the variable's scope that matters, an attribute is always locally attached to a variable or group. Are there any counter-examples to this?

If not, I think it is clearer and more precise to eliminate the use of both "element" and "object" in the Groups proposal, and replace those words with what they actually stand for, i.e., "variable" and/or "dimension". I modified the PR to do that. Similarly, I eliminated "identifier". We could instead have defined "identifier" to mean "a variable or dimension". That would be possible. Nevertheless, I think it's clearer to just say "variable or dimension" everywhere in the text. A group could also be considered an "identifier", thought this proposal does not need to do so.

Next steps
We'll update the PR to address the points as noted above and ping again when that's been done. In addition to the items noted explicitly, I propose placing the following new terms in the glossary:

Element (or object)
Identifier
Location
Resolves to
Nearest dimension
Ancestor, sibling, descendant group

The updated (by CZ) PR eliminates "element", "object", "identifier", and "Resolves to". The updated glossary now defines "Nearest dimension", "location", and Ancestor, sibling, and descendant groups.

@JonathanGregory
Copy link
Contributor

Dear Charlie

If you're not an expert in GitHub, I'm an ignoramus, and I can't work out where to view the modifications you are proposing to the document. Perhaps you or someone else can lead me straight to the right place with a URL.

Your comments above are helpful. Thanks for making the changes. I can't remember the detail of the lateral search algorithm, which Daniel showed us at the meeting. It was hard to remember, I thought, suggesting it's not obvious. I appreciate your reasons for wanting to retain it, though. Is there a more general and easily memorable algorithm which would include the present lateral-search as a subset? For example, you could accept any variable of the right name, wherever it can be found in the entire file, if it's not found in the ancestry of the reference. That would imply that it should be an error to include more than one variable of that name in the file, and if there is more than one, there's no rule about which one to choose. The data-reader could implement any search algorithm they want to find an acceptable variable.

Best wishes

Jonathan

@czender
Copy link
Contributor

czender commented Oct 5, 2018

Jonathan,

My CF-Groups "tree", i.e., my fork of CF with the Groups modifications in them, is viewable at https://github.com/czender/cf-conventions/tree/groups
The PR against Daniel's Groups tree, which will show the diff's to the current Groups proposal (the one discussed above) is at
erget#2
I think Daniel has been on vacation and has not merged them yet. If you want I can create a PR of my tree against the base CF master tree. That might be a simpler way to view the current status. Shall I?

@czender
Copy link
Contributor

czender commented Oct 5, 2018

Jonathan,
Regarding your question about the lateral search algorithm, a succinct way of remembering it is "search upwards to the apex, from there search across going downwards". Your suggestion about using any variable with the right name if it can't be found works well for simple trees. I don't know of an existing counterexample off-hand. However, it's easy to imagine one. For example, it would perform poorly in any tree where there are multiple coordinate variables of the same name (e.g.,latitude) all stored in sibling groups as is often the NASA/ESA style. Only the in-scope latitudes (i.e., those located beneath the apex group) will have the right latitude dimension. If the desired latitude were in a sibling group, and there were other coordinate systems in the file, then the search would often return an incorrect latitude variable.
Here is another way to visualize the capabilities that would be lost by omitting lateral search from CF: Consider an exising NASA MLS, or ESA S5P or other dataset that places coordinates in sibling groups not ancestor groups. The dataset, in1.nc, would work well with your suggested simpler algorithm. However, combining that dataset with a second dataset in2.nc with
ncecat in1.nc in2.nc out.nc would perform poorly with the simpler algorithm, though it would work fine with the lateral search algorithm.

@dblodgett-usgs
Copy link
Contributor

As an FYI, if you want to compare two branches or forks, you can view a pull request without actually creating it for others to see and comment on. So in this case, we can compare @czender's "groups" branch against the current cf-conventions master by going to "Pull Requests" in either repository and clicking "New Pull Request" you can select "compare across forks" and select the fork / branch(es) you want to compare. In this case you get a comparison page with a URL like this:

https://github.com/cf-convention/cf-conventions/compare/master...czender:groups

If you scroll down you can switch to "Rich Diff" with the button that looks like a page in the upper left of a given asciidoc document.

As long as you don't click "Create Pull Request" -- you are able to just look over the diffs.

Carry on.

@erget
Copy link
Member Author

erget commented Oct 8, 2018

Sorry about the long silence everyone, I've been traveling for the past several weeks and simply hadn't seen the notifications.

As Charlie noted, he's made several changes to address the changes we've been discussing and I've incorporated them into the Pull Request. You can see the full set of changes here and the changes relative to the original PR here.

As @dblodgett-usgs notes the rich diff function is really helpful for getting an overview of the changes.

I have the impression that we're nearing conversion, is this shared by the rest of the group?

I'd like to add my thoughts to what @czender has written concerning the lateral search: Although I do think that it is a bit confusing, especially for people who are used to object-oriented paradigms using inheritance and classes, there are a plethora of data in the wild which implement this and I think that it's good to describe how things should be found. I also agree that there are many data out there which reuse variable names when referencing coordinates, and the amount of such data is set to grow. It is likely that EUMETSAT is producing such data, as we will be disseminating products which contain observations from multiple instruments, all of which use different viewing geometries. Therefore they would have different lat and lon in the same file. Lateral search won't be needed in order to find the coordinates, but it's nonetheless a good example of multiple uses of the same names. So in my view the current PR, with the new edits from @czender , does a good job of finding the middle ground and ensuring readability across data from the past, present and the future.

@JonathanGregory
Copy link
Contributor

Dear Daniel and Charlie

Thanks to you for the changes, and to Dave for reminding me where to find the "rich diff". I had forgotten which icon it is. Here are some comments on the latest version.

  • Thanks for the extra definitions in chapter 1. I feel that "element" should not be included here because it's used in other senses elsewhere (for instance, in "elements of an array" in the "CDL syntax" entry of sect 1.2). It would be fine to use it in this sense in the group discussion, if you define it specifically there with that meaning.

  • "Local apex group" as defined here refers to the common ancestor starting from two different places, but that's not quite the sense in which you use it in chapter 2, it seems. Where you discuss "lateral search", "local apex group" appears to mean the very top of the tree. You search upwards until you hit the top, and if you haven't found it, you start the lateral search, top-down.

  • "Nearest" is defined as being according to the rules for finding things (in chapter 2). But in those rules, you use "nearest" to say how to find things! This is circular. (A means "see B" and B means "see A".)

  • In chapter 2, I am puzzled by the last sentence of "search by absolute path", which says, "Thus a variable /g2/temperature with coordinates attribute containing /g1/lat is permissible so long as there is no other lat dimension defined 'in between' those locations, e.g., /lat." I thought absolute meant absolute, but you're making it conditional. That's not what a Unix absolute path means. Could you explain what you mean here, please? I don't understand the preceding sentence either.

  • Search by proximity depends on a definition of "nearest" (see comment above). I think by "nearest" you mean the one which is reached by the smallest number of upward steps. Because downward steps aren't allowed this isn't the usual meaning of "nearest", so it's worth being clear. Usually you would say that a daughter was a near relation of a mother, but here a daughter is entirely discounted.

  • Lateral search. There is no problem with having many variables in a file with the same name if they're in different groups. As Daniel says, by itself that doesn't require the lateral search algorithm. As currently stated, the lateral search algorithm doesn't look well-defined, because it says it searches "across" each level, but doesn't specify what collating order "across" means. No doubt that could be defined as part of the algorithm. I understand that you want to allow this algorithm because there are many existing datasets which depend on it, but (a) the existing datasets aren't ever going to be CF datasets, unless they're all rewritten to insert an appropriate Conventions attribute; (b) for new datasets, couldn't the data-writers use a better convention than they currently do? In many applications, CF compliance means doing things in a slightly different way to before. If they aren't willing or able to change to absolute/relative/nearest paths, maybe you could define lateral search as a subconvention which they could use but most CF data-writers would not?

  • I have a fairly large number of reservations about the proposed Appendix I. Partly this is because I don't understand some of it, partly because I feel that it goes beyond what CF normally does in being prescriptive. If you're keen to include this in the convention, I think that more discussions in detail will be needed.

I am keen on the approach in general and most of the details! I hope we will agree soon. Thanks for your efforts and patience.

Best wishes

Jonathan

@davidhassell davidhassell added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Oct 10, 2018
@erget
Copy link
Member Author

erget commented Oct 10, 2018

Hi Jonathan,

Thanks for this input - I share your sentiment and also hope that we can reach a positive conclusion soon. I'll respond to your points inline below. All of these changes are reflected in the updated PR.

Thanks for the extra definitions in chapter 1. I feel that "element" should not be included here because it's used in other senses elsewhere (for instance, in "elements of an array" in the "CDL syntax" entry of sect 1.2). It would be fine to use it in this sense in the group discussion, if you define it specifically there with that meaning.

I see what you mean about the use of "element" in other parts of the Convention, particularly in reference to individual items within an array. I've reviewed our proposal and don't see a reason to define "element" specifically, and thus I've removed the definition from the glossary.

"Local apex group" as defined here refers to the common ancestor starting from two different places, but that's not quite the sense in which you use it in chapter 2, it seems. Where you discuss "lateral search", "local apex group" appears to mean the very top of the tree. You search upwards until you hit the top, and if you haven't found it, you start the lateral search, top-down.

This is true. I've also modified the wording in that section of the proposal; you can find it in the updated version. @czender please correct this if I have misunderstood how the lateral search works.

"Nearest" is defined as being according to the rules for finding things (in chapter 2). But in those rules, you use "nearest" to say how to find things! This is circular. (A means "see B" and B means "see A".)

I hope to have resolved this by changing the "Search by proximity" text as follows:

"A variable or dimension specified with no path is the variable or dimension of the same name which can be found via a search of direct ancestor groups with the shortest number of intermediate groups.
For example, a coordinates attribute of lat refers to the lat variable (if any) in the present group.
If the variable or dimension is not in the referring group then it is termed an out-of-group reference.
Out-of-group references are resolved by searching all direct ancestors, starting from the direct ancestor and proceeding toward the root group, until the specified variable or dimension is found."

Do you find this clearer?

In chapter 2, I am puzzled by the last sentence of "search by absolute path", which says, "Thus a variable /g2/temperature with coordinates attribute containing /g1/lat is permissible so long as there is no other lat dimension defined 'in between' those locations, e.g., /lat." I thought absolute meant absolute, but you're making it conditional. That's not what a Unix absolute path means. Could you explain what you mean here, please? I don't understand the preceding sentence either.

I see what you mean here, actually I think this is just a byproduct of text being moved around a lot. I've changed those three sentences to read:

"When referencing out-of-group variables, it is the responsibility of the dataset producer to ensure that the dimension(s) utilized by the out-of-group variable are the same as those used by the referring variable."

The intent here is to make sure that you don't refer to e.g. an auxiliary coordinate variable which uses different dimensions than the referring variable. If a data producer were to do this, it would not be possible to match the coordinates to the variable calling them up; the original text was designed to prevent this. However, the two sentences you referred to were in the wrong spot, and in my opinion they're not needed anyway, as the scoping rules clearly explain how that should work anyway, so I've deleted them and moved the whole block into the "Scope" section.

Search by proximity depends on a definition of "nearest" (see comment above). I think by "nearest" you mean the one which is reached by the smallest number of upward steps. Because downward steps aren't allowed this isn't the usual meaning of "nearest", so it's worth being clear. Usually you would say that a daughter was a near relation of a mother, but here a daughter is entirely discounted.

Right. I think that the newest version is clearer and reflects this unambiguously.

Lateral search. There is no problem with having many variables in a file with the same name if they're in different groups. As Daniel says, by itself that doesn't require the lateral search algorithm. As currently stated, the lateral search algorithm doesn't look well-defined, because it says it searches "across" each level, but doesn't specify what collating order "across" means. No doubt that could be defined as part of the algorithm. I understand that you want to allow this algorithm because there are many existing datasets which depend on it, but (a) the existing datasets aren't ever going to be CF datasets, unless they're all rewritten to insert an appropriate Conventions attribute; (b) for new datasets, couldn't the data-writers use a better convention than they currently do? In many applications, CF compliance means doing things in a slightly different way to before. If they aren't willing or able to change to absolute/relative/nearest paths, maybe you could define lateral search as a subconvention which they could use but most CF data-writers would not?

@czender I'll pass this to you, as the King of Lateral Search :)

I have a fairly large number of reservations about the proposed Appendix I. Partly this is because I don't understand some of it, partly because I feel that it goes beyond what CF normally does in being prescriptive. If you're keen to include this in the convention, I think that more discussions in detail will be needed.

What would you suggest for the appendix? Correct me please if I'm wrong, but I don't think that the appendix is prescriptive - the title is "Recommended Practices and Annotated Examples for Using Groups" so I see no danger that people would see this as a prescription. At the last meeting in Reading my feeling was that the community felt that having examples as guidance for different data types would be helpful; we've included them in the appendix with the intent of providing such examples without being prescriptive. I'm happy to discuss this in greater detail, and I don't want to regard your reservations, but I need more information about what they are in order to try to address them :)

Best regards,
Daniel

@JonathanGregory
Copy link
Contributor

Dear Daniel

Thanks. This is much better. "Search by absolute path" should start with "A" not "An". In "Search by relative path" I would say "(i.e. containing a slash but not with a leading slash e.g. child/lat)".

I think "Search by proximity" can be stated more concisely and perhaps more clearly with less repetition, thus:

"A variable or dimension specified with no path (for example, lat) refers to the variable or dimension of that name, if there is one, in the referring group. If the variable or dimension is not in the referring group then it is termed an out-of-group reference. Out-of-group references are resolved by searching all direct ancestors, starting from the direct ancestor and proceeding toward the root group, until the specified variable or dimension is found.

"For coordinate variables that are not found in the referring group or its ancestors, a further strategy is provided, called lateral search. The lateral search proceeds downwards from the root group width-wise through each level of groups until the sought coordinate is found. This placement of coordinates and lateral search strategy to find them are discouraged. They are allowed mainly for backwards-compatibility with existing datasets, and may be deprecated in future versions of the standard."

I didn't understand the justification "Because coordinate variables must share dimensions with the variables that reference them", which I have omitted above. However, it made me realise that the lateral search algorithm may have problems with dimensions. It could find a coordinate variable with a dimension that has the right name e.g. lat(lat) but where lat isn't the same dimension as in the referring group (i.e. it has a different size) if it's been defined differently somewhere between the root group and the group which contains the coordinate variable. This is dangerous. At least it requires another rule to state that the dimensions must be of the same size as in the referring group. Is the lateral search algorithm intended to work for auxiliary and scalar coordinate variables too?

The section on "Application of attributes" would be simpler and more compatible with the conventions if it referred to Appendix A. Maybe I've said that before? It seems to me that global atts mentioned in Appendix A should be allowed as group atts, with some exceptions e.g. Conventions, maybe external_variables, but why would you made an exception of title for example? Perhaps the simplest thing would be to modify Appendix A to show which G (global) atts are not permitted as group atts, and say that other G atts are allowed as group atts. I feel that you don't need to list the variable atts in this section.

Sorry to be vague about Appendix I. Yes, I understand and appreciate the intention of it. Please could we talk about that once we've agreed about the text of the main conventions?

Best wishes and thanks

Jonathan

@czender
Copy link
Contributor

czender commented Oct 12, 2018

Hi All,
Just a note to reassure everyone that I am receiving/following these recent comments questions and will compose a response next week after I unpack. In the meantime, please be aware that the ancestor search algorithm and the lateral search algorithm, as envisaged by the proposers, are not quite what Jonathan seems to think they are, so we need to do a better job of explaining them. Both searches should search no higher than the local apex group where the dimensions underlying the referring variable (and thus its coordinates) are defined. This is in general (and in practice) often lower down than the root group. In netCDF speak, the dimensions underlying the variable and the coordinate must be not only the same name, but the same ID (dimension IDs are unique in netCDF). Otherwise chaos ensues. The variable and the coordinate must both be in the scope (i.e., child groups) of the group where the dimension(s) are defined. That "feature" is due to the way that netCDF implements the Common Data Model (CDM). Hence ancestor search goes no higher than the group where the dimension(s) are defined, same as lateral search. This is the local apex group.

@JonathanGregory
Copy link
Contributor

Dear Charlie

You're right, I had not understood that from the current wording, but indeed, this is something that has to be addressed, which I hadn't thought of. Also, does this algorithm apply to multidimensional auxiliary coordinate variables? It's possible that their dimensions might be defined in different levels of group, which is trickier. The upward search can proceed only as far as the group in which the dimension (or any of the dimensions) of the coordinate (or auxiliary coordinate) variable is defined.

Best wishes

Jonathan

jdemaria added a commit to jdemaria/gdal that referenced this issue Dec 17, 2018
    - explore recursively all nested groups to create the subdatasets list
    - subdatasets in nested groups use the /group1/group2/.../groupn/var standard
      NetCDF-4 convention, excepting for variables in the root group which do not
      have a leading slash for backward compatibility with the NetCDF-3 driver
    - when accessing a subdataset using NETCDF:$file:$path, the leading slash is optional
    - global attributes of each nested group are also collected in the GDAL dataset
      metadata, using the same convention /group1/group2/.../groupn/NC_GLOBAL#attr_name,
      excepting for the root group which do not have a leading slash for backward compatibility
    - when searching for a variable containing auxiliary information on the selected subdataset,
      like coordinate variables or grid_mapping, we now also search in parent groups (using NCDFResolveVar).
    - reference to coordinate variables using the "coordinates" attribute support now also absolute paths,
      this allow for example to specify coordinate variables located outside the group of the selected variable
      or its parents. Relative paths could be implemented if needed.

WARNINGS/TODOS:
    - all this work must be consolidated with current CF discussion about
      adding groups support to the convention: cf-convention/cf-conventions#145
    - I think "lateral search" should be added to the driver for better
      support of existing NetCDF-4 files from NASA and ESA angencies, even
      if this will not be included in future CF convention: this point is currently
      discussed by the CF team here: cf-convention/cf-conventions#144 (comment)
    - for the moment the NetCDF vector support is disabled because I failed to
      merge it with the groups support for raster (I work on it but it's very complex)
    - at this time I did not check current GDAL autotests, and new ones must
      be added to validate groups support (covering all the changes)
    - maybe some config variable should be added to be able to disable
      the new groups support
    - driver documentation must be updated
@davidhassell
Copy link
Contributor

The way I do it is to go to the pull request - #145 in this case - and then click on the "Files changed" tab. This then shows all the files changed as raw diffs, and you can choose to view the Rich diff, or view the whole file.

If there's a different/better way - I'd like to know, too!

@cameronsmith1
Copy link

I usually do it the way David describes. However, if that doesn't work, you could download the files and diff them using your favorite tool.

@JonathanGregory
Copy link
Contributor

Dear Daniel, Charlie and others

Regarding my novice question, the point (for me) to remember is that the pull request number can't be discovered automatically from an issue. Hence it's convenient for readers to have it restated it in an issue whenever it's relevant, I suppose.

Thanks for your work on this, and for the clarifications and compromises. I am almost content with this proposal now! I realise I'm still contributing just now as a reviewer rather than a moderator, but don't be concerned. I can be objective and tell myself when the debate has been concluded satisfactorily. :-) Also let me emphasise that I think this proposal fits well into CF and will be a valuable addition.

  • The lateral search is well-defined now, and the problems I was concerned are avoided by defining the local apex group as being where the dimension is found, and not allowing this algorithm for aux coord vars. A minor point: in the definiton of local apex group you write, "in which a dimension underlying an out-of-group coordinate is defined", and I wonder, why "underlying"? It's not more special than "of" would mean, is it? I appreciate that you don't want to deprecate lateral search at present because of excluding some significant datasets. I also understand that it's regrettable to exclude aux coord vars. I think we could consider admitting them too provided a reliable algorithm is defined. I hope that with the datasets that use this method there is indeed a way to be sure you've got the right variable. We could return to that when this proposal is agreed.

  • I don't we should allow Conventions at group level, even if it is overridden at the file level. It seems potentially confusing to me that a single file could simultaneously use more than one version of CF, and possibly it might be inconsistent. For example, what if an out-of-scope time coordinate variable used a calendar which is permitted in its version of CF but is not included in the earlier version of CF applicable to the referring variable. Maybe there are more likely examples than that - I haven't spent long thinking about it, but it concerns me!

  • I think that Appendix A should be modified to show the new status for title and history being allowed for groups. I suppose we haven't thought about this before. Letter G is current used for "global", but you could amend that to "F" for file and use G for "group" instead.

I am sure we can adopt this proposal soon. As soon as we have, we should consider your appendix on best practice for groups as a separate addition, if you don't mind - only because otherwise I'm worried about delaying the main proposal.

Best wishes

Jonathan

@czender
Copy link
Contributor

czender commented Apr 18, 2019

Thank you for refreshing this proposal with your new questions and comments, Jonathan. Regarding "underlying": I agree that it is synonymous with and could be replaced by "of" in this context. Regarding Conventions: I share your opinion that Conventions should only be allowed at the file level. Anything else would be considered non-compliant. I will consider modifying NCO to automatically strip (or rename, more likely) group-level Conventions attributes when aggregating files, just to make it simpler to produce CF-compliant files using ncecat --gag, and harder to produce non-compliant files.

@erget
Copy link
Member Author

erget commented Apr 18, 2019

Hi @JonathanGregory and all, sorry for spamming with multiple force pushes but I was having trouble with table formatting and didn't have access to an offline renderer. I've implemented the comments as requested. Jonathan, do you still feel good about summarising the discussion or should we reach out to somebody else from the governance panel when the 3 weeks are up? I feel you're pretty objective but I don't want to put you in an uncomfortable spot ;)

@JonathanGregory
Copy link
Contributor

Dear all

Thanks for the latest changes, Daniel. Personally I am happy with this now. I am willing to be moderator still as well! Speaking in that capacity, my summary of this discussion is that there are apparently no unanswered comments or questions. Does anyone have any further comment to make on the proposal as currently shown in #145? Note that this pull request does not contain the new appendix on best practice for groups, which we'll consider separately when this one is agreed. Daniel, is there a corresponding pull-request for the conformance document?

Best wishes

Jonathan

@erget
Copy link
Member Author

erget commented May 2, 2019

Thanks @JonathanGregory !

Currently there is no corresponding pull request for the conformance document - I'm a bit swamped right now but I'll see if I can find a willing partner in crime so we can produce one ;)

Best regards,
Daniel

@czender
Copy link
Contributor

czender commented May 6, 2019

@erget and @JonathanGregory ,

I drafted a test netCDF file:

cf_grp.cdl
cf_grp.nc

I tried to make it succinct, yet able to test numerous Groups features, including global/group attributes, absolute/relative/ancestor searches for out-of-group coordinates in an ancestor group, absolute/relative/ancestor searches for out-of-group coordinates in a sibling group, distinct coordinates with the same name yet different dimensions in the same file, incorporating ensemble realization in both the group name and as metadata. Feedback welcome. Is this too much? too little? Feedback welcome. If it looks good I'll create a PR for it.

Charlie

@erget
Copy link
Member Author

erget commented May 14, 2019

Hi @JonathanGregory - we've passed the three week boundary again. Should we freeze the proposal as it stands? What are the next steps? My thoughts are that now that the content of the change is agreed we might move on to "implementing" it as rules in the conformance document.

@JonathanGregory
Copy link
Contributor

Dear Daniel
I had noted this as due for attention tomorrow! If no-one has any more comments today (14 May), then this change has been agreed and should be included in the next release of CF, provided that the conformance document changes have also been prepared by then.
Thanks for the test file, Charlie. That's not something we usually have, but it's a good idea, often stated to be desirable! For the conformance document, we need a list of checks that can be made, and whether the failure of each of them should produce a warning (because it breaks a recommendation) or an error (because it breaks a requirement).
Cheers
Jonathan

@czender
Copy link
Contributor

czender commented May 22, 2019

@erget I suggest that this text be incorporated into a PR for the Conformance document. These points are relatively easy to test for conformance. I could not think of any other points that the Groups feature allows, forbids, or recommends that could be easily tested for. Do you think this is complete?

2.7 Groups:

Requirements:

Auxiliary coordinate variables must reside in the referring group or one of its direct ancestors.

The attributes Conventions and external_variables may be used in the root group only.

Per-variable attributes including units, _FillValue, scale_factor, add_offset, valid_min, valid_max, and valid_range must only be attached directly to variables, and not provided as group-level attributes.

Recommendations:

NUG-coordinate variables that are not in the referring group or one of its direct ancestors should be referenced by absoulute or relative paths rather by relying on the lateral search algorithm.

@erget
Copy link
Member Author

erget commented Jul 15, 2019

@czender @JonathanGregory forgot to update on this - I've proposed a corresponding update to the Conformance document in this PR. Next steps?

@JonathanGregory
Copy link
Contributor

Daniel @erget, yes, I will look at the proposed update to the conformance document. I will do that next week. Ros has had a look at it and commented by email as well.

@erget
Copy link
Member Author

erget commented Jul 18, 2019

@JonathanGregory thanks very much!

@JonathanGregory
Copy link
Contributor

Dear Daniel et al.
I'm reopening this because the conclusion of the conformance document is an essential part of making
the change to the conventions. Once we have them in the same repository, so they can be done will the same pull-request, that will seem more obvious! Ros, David and I discussed the conformance requirements this morning. Ros is going to post some comments on the draft requirements in cf-convention/Conformance#8, and a small number of suggested further changes to the conventions about groups, which came up while considering conformance checking, for clarity and consistency. (These aren't large changes of substance to the proposal, as you will see.)
Best wishes
Jonathan

@RosalynHatcher
Copy link
Contributor

Dear All,
As Jonathan has said, as the result of looking at the conformance requirements for groups I have a couple of comments/clarifications. (Sorry if these points have already been discussed and I've missed it.)

  • With the introduction of variables/dimensions being referenced as paths this obviously impacts many areas of the convention, but does the groups proposal expect that these all may contain out-of-group references? E.g. cell boundaries, cell measures, etc

  • If paths are allowed in all areas of the convention it would be good to have a general statement that where "variable" or "dimension" is used this can either be a single variable name or a path to a variable. Perhaps we could also add path to the definitions in chapter 1? So that we don't have to define the term specifically in the conformance document.

  • Although the CF standard doesn't standardize group names, which I understand is due to the abundance of NASA datasets that wouldn't meet this, I wonder whether it might be good to recommend that group names follow the same naming conventions as variables, dimensions and attributes (ie. begin with a letter and be composed of letters, digits and underscores). This way we would be encouraging human readable names but still allowing them not to be if required by some organizations.

Best Regards,
Ros.

@ghost
Copy link

ghost commented Jul 26, 2019

  • Although the CF standard doesn't standardize group names, which I understand is due to the abundance of NASA datasets that wouldn't meet this, I wonder whether it might be good to recommend that group names follow the same naming conventions as variables, dimensions and attributes (ie. begin with a letter and be composed of letters, digits and underscores). This way we would be encouraging human readable names but still allowing them not to be if required by some organizations.

Yes. The NASA Dataset Interoperability Recommendations for Earth Science: Part 2 recommends exactly that.

@czender
Copy link
Contributor

czender commented Jul 26, 2019

With the introduction of variables/dimensions being referenced as paths this obviously impacts many areas of the convention, but does the groups proposal expect that these all may contain out-of-group references? E.g. cell boundaries, cell measures, etc

Yes. Cell boundaries, cell measures, ancillary_variables, climatology, formula_terms, etc. may reference out-of-group (OOG) variables with the same restrictions that apply to coordinates (which are the focus of the OOG portion of the Groups documentation). Since OOG coordinates are allowed, we must allow OOG bounds variables, for example, and the reasoning for the whole menagerie like anvillary_variables, climatology, formula_terms, etc. follows suit.

@czender
Copy link
Contributor

czender commented Jul 27, 2019

If paths are allowed in all areas of the convention it would be good to have a general statement that where "variable" or "dimension" is used this can either be a single variable name or a path to a variable. Perhaps we could also add path to the definitions in chapter 1? So that we don't have to define the term specifically in the conformance document.

Agreed.

@dblodgett-usgs
Copy link
Contributor

@erget -- can we identify a moderator for this? I haven't been following this closely enough to play that roll unfortunately. Maybe @JonathanGregory?

@erget
Copy link
Member Author

erget commented Aug 23, 2019

Jonathan moderated (very helpfully and patiently!). I'm closing this as the PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

Successfully merging a pull request may close this issue.