Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are too many mandatory fields #26

Open
geoffmatt opened this issue Aug 19, 2020 · 15 comments
Open

There are too many mandatory fields #26

geoffmatt opened this issue Aug 19, 2020 · 15 comments

Comments

@geoffmatt
Copy link

This is a general problem with implementing this format.
There are a very large number of mandatory fields, the majority of which are not actually required for all applications of acoustic data.
For example platform latitude, longitude and heading in the beam group table. These would typically be useful parameters, but it is an assumption that they are mandatory. Many acoustic data sets are recorded (for better or worse) without accompanying GPS. By making these fields mandatory we are forcing people to write nonsense values into them. This is especially problematic for fields such as longitude and latitude because it is likely that the nonsense value will be 0.0, however this is a valid Long or Lat value. Therefore if becomes impossible for the reader to determine if these values should be used or not, introducing bugs.

@cyrilleponcelet
Copy link
Contributor

Hi, for the example above people should use NaN (Not A Number) for float or double to indicate invalid values.
NaN will be propagated by any computation using this values having NaN as a result.

@geoffmatt
Copy link
Author

Hi, thanks for the response.
NaN isn't automatically propagated by any computation. In fact most programming languages would have to specifically check for an NaN value before performing any calculations, otherwise an exception would occur. We would have to check every value being read in from this file format for NaN in order to be safe, and make a decision on what to do with the value should it be NaN, which is very expensive from a maintenance perspective.
More importantly though, there is a difference between a mandatory value which is invalid, and a value which is not mandatory.
In this case we are unable to distinguish between the lat/long values being in error, i.e. invalid, and the lat/long values being omitted because the system does not record them.
It is simply incorrect to state that these values are mandatory, when it is clear that they are not necessary for the data within the file to be useful. If they were truly mandatory then the file would be useless without them, but that is not the case, hence they are not mandatory.
The format as it stands is forcing people to write NaN values when they are not required, which is both deceptive, and inefficient.

@cyrilleponcelet
Copy link
Contributor

Hi,
By NaN we should probably be more specific and talk about quiet NaN.

Do you have any concrete example were lat/lon data would be missing ? Data are supposed to be acquired by some physical device, and thus can be georeferenced, even if the sounder in in a fixed position. This allow to develop algorithm that works for every files allowing geolocalisation of data in space (like echo integration, display of data in SIG 2D/3D, ...) .
The idea between having lat/lon mandatory fields is also to reduce the number of cases (see #27) and have some common set of field that are expected in every files.
Of course this is open to discussion, what is the use case where we cannot define any physical location in the world ?

@geoffmatt
Copy link
Author

The key point is "data can be georeferenced", but it doesn't have to be. The echosounder itself does not require georeferencing to function.
It looks like you are trying to enforce a working method with the file format. I.e. high quality data should be georeferenced, therefore this information should be mandatory. This is similar to saying high quality acoustic data should be calibrated and therefore calibration information should also be mandatory.
Both these points are correct, ideally data should be calibrated and geo-referenced. However this is trying to force a very narrow view of how acoustic data should be collected and used. Sure someone on a boat, doing a biomass integration will be ok with this. But what about someone testing an echosounder in a tank? Why would they need GPS data?

The important point is that this is trying to force the user's work paradigm. This is unlikely to work, because there is no "correct paradigm" for working with acoustics. It is too varied a field for there to be a single method which can be applied to all cases.

@gavinmacaulay
Copy link
Collaborator

Version 1 of the convention has the geographic position as MA (mandatory if available/applicable), for this reason - not all uses of sonars (or echosounders) need, or will have, position. And in some cases the position is unavailable - e.g., equipment failures, or the Simrad WBAT (which can't currently receive a GPS feed) - in these cases it can be added afterwards, but this shouldn't prevent data directfrom the instrument from meeting the convention.

The convention hasn't really resolved whether to prefer variables containing a 'no data' value or whether the variable should be absent. Currently, both are used in different places. Comments on this?

@geoffmatt
Copy link
Author

I'd prefer the data to be absent if it's not appliccable because it allows us to differentiate the cases where:
a) The data isn't present in the recording device
b) The data should be present but for some reason there's an error in the recorded value

It's useful for the software to be able to post warning messages in the case of b) to tell the user that something is wrong. Whereas in case a) there is no error so we don't want to warn

@gavinmacaulay
Copy link
Collaborator

Further to the above, Furuno have raised some queries regarding mandatory variables that lead to these questions:

  1. Should beamwidth_transmit_{major|minor} have an obligation of mandatory (M) or mandatory if applicable/available (MA)? Currently they are MA. They ask because the mandatory variable attribute 'substitute_value_used' could be useful for those specific variables (but is currently only applied to M obligations). Options are:
  1. Allow 'substitute_value_used' on MA obligations
  2. Change the obligation from MA to M
  1. platform_heading(ping_time) in the beam subgroup has obligation M but heading(time) in the platform group has obligation MA (@cyrilleponcelet). Options are:
  1. If this was as intended, the convention should explain this apparent inconsistency
  2. Make them consistent
  1. We need to reach a decision on what to do for variables with mandatory obligation, but which are not available to the sonar (e.g., no/faulty GPS, no/faulty MRU), and to document that in the convention. So far the options are:
  1. MA so that they are not present if not available
  2. M and allow NaN if they are not available
  3. MA and allow NaN (not sure what this would achieve, but it makes up the complete set...)

Comments on the above welcome...

@cyrilleponcelet
Copy link
Contributor

Hi @gavinmacaulay

1 : It seems to me that substitute_value_used is used both for MA and M variables for example receive_duration_effective,

2 : I think heading both in attitude_sub_group and position_sub_group should be set as Mandatory. As far as I know every sensor provides these data.

3 : I'm in favor of M allows NaN. My feeling is that having variable defined whenever it is possible will make the format more consistent and more simple than have to code a set of test or check ancillary variables explaining why something is declared or not. This makes the share of method or tools more easy and helps the use of the format.
This does not avoid to have some comments (like in /Provenance) explaining how data where acquired and why all values for georeferencing are Nan in some cases for example missing GPS.

@akiraokunishi
Copy link
Contributor

Bonjour to all: @gavinmacaulay, @cyrilleponcelet, @geoffmatt,

I would like to make comments on the items 2 and 3, based on our common understandings of the “M” and “MA” obligations.

The convention, since Version 1.0 (CRR No. 341), states that:
(S1) “Some variables and attributes in SONAR-netCD4 are mandatory; these form the minimal set of data required to quantitatively use backscattering amplitude data.”
(S2) “If a variable is mandatory, it must be present and must contain data.”
(S3) “MA: mandatory if applicable or available.”

My understandings are:
(C1) S1 is not consistent with that “heading is mandatory”, because backscattering amplitude data can be quantitatively used without heading values.
(C2) S2 is not consistent with the option “M and allow NaN”, because if NaN was allowed, the condition “it must contain data” would lose its meaning.
(C3) S3 implies that the option “M and allow NaN” can be replaced by “MA” without any inconsistency.

For further discussions,
(Q1) If C1 to C3 include any mistakes, please kindly point them out.
(Q2) If S1 to S3 need to be given any changes, please share the intensions and the alternative statements.

Any other comments would be appreciated too.

@gavinmacaulay
Copy link
Collaborator

My opinions:

  1. We allow substitute_value_used for MA variables, where wanted.
  2. Heading is necessary to quantitatively use some type of sonar data (e.g. omnisonar target tracking, multibeam bathymetry), but is not necessary for other types (e.g., echosounder biomass surveys). Quite a lot of quantitative echosounder data doesn't have heading included (e.g., stationary systems, operations from small boats without a heading sensor or MRU, etc). To me, this suggests that heading is more appropriate as MA because it can't be M if a commonly-used instrument configuration doesn't include - and doesn't use - a heading sensor).
  3. (C2) above is correct - version 1 of the convention required actual data in M variables, although when I wrote that I was thinking that occasional NaNs in an M variable was ok - it was an M variable that was all NaNs that I wanted to prevent (and that MA was the appropriate obligation for those variables).

I will think some more about the M/MA and NaN options and combinations - there are benefits and disadvantages to both of the main options...

@geoffmatt
Copy link
Author

I am in agreement with Okunishi-san (@akiraokunishi) on this one.

  1. Allowing substitute_value for MA variables would be an improvement. It would be more consistent in the sense that if an MA variable is applicable to the data, then it becomes mandatory by definition. In that case it should follow the same rules as M variables, which would allow a substitute value.
  2. I sort of agree with @gavinmacaulay here. To be honest I don't think Heading or GPS data should even be MA, it should be Optional. To my knowledge no echosounder actually generates this information, instead they are reliant on a separate sensor (typically a vessel sensor) to provide it. This means that pretty much any echosounder can be used without such a sensor, and therefore without heading or GPS data. By making it M or MA we are suggesting that doing so is an error, which I have not seen a justification for.
  3. I think there is a flaw in the definition of MA. Previously this was defined as "Mandatory if Applicable" which makes sense to me. If the data comes from a Simrad system there may be variables which become Manadatory, which would not be if the data was coming from a Furuno system. However the idea of "Mandatory if Applicable or Available" doesn't make sense to me. Either the variable is mandatory or it is not. If the file would be considered Ok without the variable being present, then that variable is not Mandatory. In this case I believe the variable should be marked as Optional.

@akiraokunishi
Copy link
Contributor

Gavin-san (@gavinmacaulay), Geoff-san (@geoffmatt),
Thank you for sharing well-considered opinions.

(C4) “Mandatory” obligation

Your opinions led me to look at a phrase in the existing convention (in (S1) above):
“... the minimal set of data required to quantitatively use backscattering amplitude data.”

It seems that adding a statement below, after this sentence, can clearly and uniquely define the “minimal set” and then “mandatory” in this convention for sonar data:

(Sa) “Specifically, this is the set of data required to evaluate target strengths and volume backscatter strengths using proper types of conversion equations.”

The existing statement below also should be maintained.
(S4) “The set of mandatory variables and attributes has been chosen so that sonar systems can directly generate SONAR-netCDF4-conforming files without needing survey, experiment, or cruise-specific data.”


(C5) “Mandatory if applicable” obligation

Geoff-san gave a specific example for “mandatory if applicable”.
It seems that adding a similar explanation would be better for readers/users of the convention, e.g.:

(Sb) “Variables that are required for only particular (i.e., not all) types of the equations have been given ‘mandatory if applicable’ obligations”.


(C6) “Mandatory if available” obligation

I think I understand the flaw that Geoff-san points out.
That is, if a variable is mandatory, sonars must write its values in netCDF4 files. So, the data must be available, and then “if available” has no meaning.

However, I also understand that “mandatory” in this convention simply means “must be recorded” for sonars.
So, “mandatory if available” can make sense to me as “must be recorded if available”.


For further discussions,
(Q3) If (Sa) in (C4) causes any inconsistency or difficulty, please kindly let me know.

Any other comments would be appreciated too.

@emiliom
Copy link

emiliom commented May 20, 2023

I've read through this issue to figure out how to interpret the convention regarding whether MA variables for which no data is available must be present but filled with NaN, or can be omitted. The dicussion here has been very helpful. As far as I can see, though, there's no definitive position in the convention or here. As @gavinmacaulay stated earlier in the issue:

The convention hasn't really resolved whether to prefer variables containing a 'no data' value or whether the variable should be absent. Currently, both are used in different places. Comments on this?

The statement in the convention documents (both vers. 1 & 2), "Any non-mandatory variables can be absent from a SONAR-netCDF4 file", has its own ambiguity: Does this mean that MA variables can be absent, or is it referring only to R and O variables?

I see that there are programmatic pros and cons for either approach. On the cons, an extreme example would be backscatter_i. If there's no imaginary component, creating this variable as defined in the convention and populating it with all NaN is inefficient.

Further discussions about whether some variables should be redesignated from Mandatory (M) to MA also seem unresolved. @akiraokunishi's most recent comments were a helpful refinement.

At a minimum, I think there's tacit agreement that omitting MA variables falls within current practice. Is that a reasonable conclusion?

Thank you.

@cyrilleponcelet
Copy link
Contributor

Hi @emiliom
On remark on data size : with compression enabled on variables, if one of the is filled with a unique value it's size should become close to zero. Except for vlen type like backscatter_i where I'm not sure if compression can be applyied for an array of empty vlen.

I agree with you that there is an tacit agreement that omitting MA variable if they have no meaning here.
To be more specific I do not expect to have backscatter_i variable declared if I have no complex records or type 5 equation. I think that when you have a file , it is better to have only declared variables that are meaningfull to the kind of data recorded than to have variable filled with Nan values if they have no meaning in that particular dataset.

@emiliom
Copy link

emiliom commented Jun 6, 2023

Thanks so much for your input, @cyrilleponcelet. That's very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants