Support for labeled missingness #880

khusmann · 2024-02-18T00:22:05Z

Presently, missing values (whether specified at a schema-level or field-level) are specified using their physical values.

It is not uncommon for these physical missing values to represent logical reasons for why a given observation might be missing (e.g. "Participant skipped item", "Declined", "Not applicable"). Sometimes these reasons can be inferred from the physical values ("SKIPPED", "DECLINED", "N/A"). Other times these reasons are encoded: (-99, -98, -97).

As extensively discussed here, many software packages (Stata, SPSS, MPlus, etc.) use the concept of "value labels" to map encoded categorical values to logical categorical levels as well as physical missing values to logical missing reasons. The recently proposed categorical field type (#875) would give us the former capability: mappings from physical categorical encodings to logical categorical levels. Here, I want to propose a similar pattern to give us the latter ability as well: mappings from physical missingValues to logical missing reasons.

Below is the proposed type signature (for use in both schema-level and field-level missingValues definitions). Note that in accordance with the V2 spec change rules, it would not interfere with the existing way of specifying missingValues (i.e. an array of strings without labels is still valid):

{
  missingValues: ({ value: string, label?: string } | string)[]
}

Alternatively in RFC language: missingValues MUST be an array where each element is a JSON object or a string. If the array element is an object, it MUST have a value field of type string and optionally a label field, also of type string.

With this definition, the above example would translate to a missingValues spec like this:

{
  "schema": {
    "fields": [...],
    "missingValues": [
      { "value": "-99", "label": "Participant skipped item" },
      { "value": "-98", "label": "Declined" },
      { "value": "-97", "label": "Not applicable" }
    ]
  }
}

This format is also easily extendable with reason-level metadata via user defined fields:

{
  "schema": {
    "fields": [...],
    "missingValues": [
      { "value": "-99", "label": "Participant skipped item" },
      { "value": "-98", "label": "Declined" },
      { "value": "-97", "label": "Not applicable" },
      { "value": "-96", "label": "Special incident 1", "description": "The survey software crashed in our final week of collection causing significant data loss" }
    ]
  }
}

And as mentioned above, unlabelled missingness would still be valid as an array of strings:

{
  "schema": {
    "fields": [...],
    "missingValues": ["-99", "-98", "-97"]
  }
}

Although other software like Pandas / SQL / Polars / R, etc. do not natively support the concept of labeled missingness, or even types of missingness (all missingness values are collapsed into a single Null / NA / None type), it is still possible to emulate the behavior of a "tagged missing" type by loading values and missing reasons into separate columns, as I describe here (for an R implementation). So even when missing reasons are not natively supported by an implementation's environment, implementations still have the potential to make use of these missing labels, if they desire.

Tagging @peterdesmet and @pschumm for their input…

The text was updated successfully, but these errors were encountered:

pschumm · 2024-02-18T11:48:25Z

This would, together with #875, provide full support for value labels in Stata, SAS and SPSS, which would be nice. And as @khusmann points out, this additional information could be used productively by other software too if implementors wish to do so. And while it does add extra complexity, that complexity is not required (I presume it will simply be ignored by folks working in areas that don't make use of value labels) and if we accept #875, will already be present in the standard (this would just extend that same syntax option to the missingValues property).

Resolves frictionlessdata/datapackage#880. When paired with the `categorical` type, this gives us full support for the value labels found in many statistical software packages. I have included it here rather than a separate PR because these issues are intertwined and there's a synergy in addressing them simultaneously. That said, if you would rather see this in a PR, let me know and I can revert.

roll · 2024-06-05T09:18:17Z

DONE in #68

peterdesmet mentioned this issue Feb 19, 2024

Promote the Enum Labels and Ordering pattern to the Table Schema spec? #875

Closed

roll added this to the v2-final milestone Mar 21, 2024

roll added the Table Schema label Mar 21, 2024

roll assigned khusmann Apr 3, 2024

roll mentioned this issue Apr 3, 2024

Add a categorical field type frictionlessdata/datapackage-v2-draft#48

Closed

roll added the proposal label Apr 3, 2024

roll closed this as completed Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for labeled missingness #880

Support for labeled missingness #880

khusmann commented Feb 18, 2024

pschumm commented Feb 18, 2024

roll commented Jun 5, 2024

Support for labeled missingness #880

Support for labeled missingness #880

Comments

khusmann commented Feb 18, 2024

pschumm commented Feb 18, 2024

roll commented Jun 5, 2024