-
Notifications
You must be signed in to change notification settings - Fork 5
Add a categorical
field type [field property version]
#68
Conversation
Tagging Albert-Jan @fomcl |
|
||
`string` and `integer` field types `MAY` include a `categories` property to indicate that the field contains categorical data, and the field `MAY` be loaded as a categorical data type if supported by the implementation. The `categories` property `MUST` be an array of values or an array of objects that define the levels of the categorical. | ||
|
||
When the `categories` property is an array of values, the values `MUST` be unique and `MUST` match logical values of the field. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MUST
match logical values of the field
This sounds like categories
cannot contain a value that is not present in the data, but I believe we intend the reverse: the field cannot contain a value that is not in categories
. It also seems that the unique constraint should apply whether an array or array of objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! Just made some clarifications in the latest commits. Let me know if it looks good or if you have other rephrasings I should try!
@khusmann I'm thumbs up on this approach since it solves the typing confusion, but I feel the description is a tad wordy and technical sounding, mixing "level" and "value" and "categories" for the same thing and not stating until much later that
Suppose we have a field {
"name": "fruit",
"type": "string",
"categories": ["apple", "orange", "banana"]
} If {
"name": "fruit",
"type": "integer",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
]
} When the {
"name": "agreementLevel",
"type": "integer",
"categories": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
],
"categoriesOrdered": true
} When the property An |
Awesome @ezwelty, thanks for these edits! Just merged them in. It definitely reads a lot smoother now.
This was discussed in an earlier thread -- we decided against allowing this because collapsing categories should be considered a separate operation: frictionlessdata/datapackage#875 (comment) |
I think this latest round of edits is excellent, though I have a concern with the sentence above; specifically, when When the property |
Thanks @pschumm! Just merged your edit.
This is the most convincing argument to me. This effectively means we have 3 valid types of ordering for categoricals: unordered, ordered, and unknown. Then, when an implementation needs to convert it to an unordered or ordered type for analysis, summary, display, etc. it could warn ("No ordering specified for categorical, assuming unordered") or prompt the user to choose how it should be handled for that action. We'll also encounter unknown ordering when importing categoricals from a source that doesn't support ordering (e.g. a DuckDB / Parquet enum) -- it's good to have a representation for that case, instead of making assumptions about it. |
Nice work @khusmann (and co-authors)! Reads well and leaves room to implement where useful, with a reasonable fallback to just regular string/integer (with enum). |
@khusmann |
ACCEPTED by WG (6/9) |
@roll, while this is now expressed as documentation, I assume changes need to be made to the profiles as well? https://github.com/frictionlessdata/datapackage/tree/main/profiles/source |
Sorry missed it. I'll add it now |
Great! As a new PR I assume? |
Here's the latest categorical alternative approach that simply extends the existing
string
andinteger
types rather than attempting to be a top-level field type.More info / rationale here in this thread from the previous attempt PR: frictionlessdata/datapackage#48 (comment)
@pschumm @ezwelty @djvanderlaan