Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Add a categorical field type [field property version] #68

Merged
merged 9 commits into from
Jun 5, 2024

Conversation

khusmann
Copy link
Contributor

Here's the latest categorical alternative approach that simply extends the existing string and integer types rather than attempting to be a top-level field type.

More info / rationale here in this thread from the previous attempt PR: frictionlessdata/datapackage#48 (comment)

@pschumm @ezwelty @djvanderlaan

@djvanderlaan
Copy link

Tagging Albert-Jan @fomcl

content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved

`string` and `integer` field types `MAY` include a `categories` property to indicate that the field contains categorical data, and the field `MAY` be loaded as a categorical data type if supported by the implementation. The `categories` property `MUST` be an array of values or an array of objects that define the levels of the categorical.

When the `categories` property is an array of values, the values `MUST` be unique and `MUST` match logical values of the field. For example:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST match logical values of the field

This sounds like categories cannot contain a value that is not present in the data, but I believe we intend the reverse: the field cannot contain a value that is not in categories. It also seems that the unique constraint should apply whether an array or array of objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just made some clarifications in the latest commits. Let me know if it looks good or if you have other rephrasings I should try!

@ezwelty
Copy link

ezwelty commented May 30, 2024

@khusmann I'm thumbs up on this approach since it solves the typing confusion, but I feel the description is a tad wordy and technical sounding, mixing "level" and "value" and "categories" for the same thing and not stating until much later that categories restricts the valid values of a field. I also don't see why the labels should be required to be unique. R allows this ("Duplicated values in ‘labels’ can be used to map different values of ‘x’ to the same factor level."). It is also odd to suggest (I suspect in error) that the label property MUST be human-readable (they can be gobbledygook if that's what people want). Here is my attempt at an edit (I presume you are able to edit/view the raw markdown?):

string and integer field types MAY include a categories property to restrict the field to a finite set of possible values (similar to an enum constraint) and indicate that the field MAY be loaded as a categorical data type if supported by the implementation. The categories property MUST be either (a) an array of unique values or (b) an array of objects, each with a unique value property. The logical representation of data in the field MUST exactly match one of the values in categories.

Suppose we have a field fruit with possible values "apple", "orange", or "banana". The field definition would look like this if categories is (a) an array of values:

{
  "name": "fruit",
  "type": "string",
  "categories": ["apple", "orange", "banana"]
}

If categories is (b) an array of objects, each object MAY also have a label property, which when present, MUST be a string. In our example, this allows us to store our fruit with values 0, 1, and 2 in an integer field and label them as "apple", "orange", and "banana":

{
  "name": "fruit",
  "type": "integer",
  "categories": [
    { "value": 0, "label": "apple" },
    { "value": 1, "label": "orange" },
    { "value": 2, "label": "banana" }
  ]
}

When the categories property is defined, it MAY be accompanied by a categoriesOrdered property in the field definition. When present, the categoriesOrdered property MUST be boolean. When categoriesOrdered is true, implementations SHOULD regard the order of appearance of the values in the categories property as their natural order. For example:

{
  "name": "agreementLevel",
  "type": "integer",
  "categories": [
    { "value": 1, "label": "Strongly Disagree" },
    { "value": 2 },
    { "value": 3 },
    { "value": 4 },
    { "value": 5, "label": "Strongly Agree" }
  ],
  "categoriesOrdered": true
}

When the property categoriesOrdered is false or not present, implementations SHOULD assume that the categories do not have a natural order.

An enum constraint MAY be added to a field with a categories property, but if so, the enum values MUST be a subset of the values in categories.

@khusmann
Copy link
Contributor Author

Awesome @ezwelty, thanks for these edits! Just merged them in. It definitely reads a lot smoother now.

I also don't see why the labels should be required to be unique. R allows this ("Duplicated values in ‘labels’ can be used to map different values of ‘x’ to the same factor level.").

This was discussed in an earlier thread -- we decided against allowing this because collapsing categories should be considered a separate operation: frictionlessdata/datapackage#875 (comment)

@pschumm
Copy link

pschumm commented Jun 2, 2024

When the property categoriesOrdered is false or not present, implementations SHOULD assume that the categories do not have a natural order.

I think this latest round of edits is excellent, though I have a concern with the sentence above; specifically, when categoriesOrdered is not present, I believe that no assumptions should be made. For example, this property might be excluded because the data producer may not be familiar with the analytic concept of an ordinal variable. Alternatively, there can occasionally be legitimate ambiguity about whether a variable is ordered or not, and the data producer may have chosen to represent this by leaving this property off (i.e., leaving it up to the data consumer to decide this). Finally, the property may have simply been excluded in error. Thus, I would prefer that we say the following:

When the property categoriesOrdered is false, implementations SHOULD assume that the categories do not have a natural order; when the property is not present, no assumption about the ordered nature of the values SHOULD be made.

@khusmann
Copy link
Contributor Author

khusmann commented Jun 3, 2024

Thanks @pschumm! Just merged your edit.

Alternatively, there can occasionally be legitimate ambiguity about whether a variable is ordered or not

This is the most convincing argument to me. This effectively means we have 3 valid types of ordering for categoricals: unordered, ordered, and unknown. Then, when an implementation needs to convert it to an unordered or ordered type for analysis, summary, display, etc. it could warn ("No ordering specified for categorical, assuming unordered") or prompt the user to choose how it should be handled for that action.

We'll also encounter unknown ordering when importing categoricals from a source that doesn't support ordering (e.g. a DuckDB / Parquet enum) -- it's good to have a representation for that case, instead of making assumptions about it.

@peterdesmet
Copy link
Member

Nice work @khusmann (and co-authors)! Reads well and leaves room to implement where useful, with a reasonable fallback to just regular string/integer (with enum).

@roll
Copy link
Member

roll commented Jun 5, 2024

@khusmann
Absolutely tremendous work of leading this through all these iterations 👏 And big thanks to all the contributors 🎉

@roll
Copy link
Member

roll commented Jun 5, 2024

ACCEPTED by WG (6/9)

@roll roll merged commit e05eb9a into frictionlessdata:main Jun 5, 2024
1 check passed
@peterdesmet
Copy link
Member

@roll, while this is now expressed as documentation, I assume changes need to be made to the profiles as well? https://github.com/frictionlessdata/datapackage/tree/main/profiles/source

@roll
Copy link
Member

roll commented Jun 5, 2024

Sorry missed it. I'll add it now

@peterdesmet
Copy link
Member

Great! As a new PR I assume?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants