Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Commit

Permalink
Add field.categories/categoriesOrdered for string/number (#68)
Browse files Browse the repository at this point in the history
* add categories / categoriesOrdered field properties

* add modifications to missingValues

* minor formatting edits

* fix typo in missingValues section

* clarify that defined categorical levels do not have to be present in the data

* clarify uniqueness of values / labels

* reword with @ezwelty's edits

* when categoriesOrdered is not present, do not assume nature of order

---------

Co-authored-by: roll <[email protected]>
  • Loading branch information
khusmann and roll authored Jun 5, 2024
1 parent 7fd028b commit e05eb9a
Showing 1 changed file with 63 additions and 3 deletions.
66 changes: 63 additions & 3 deletions content/docs/specifications/table-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,11 +127,20 @@ A Table Schema descriptor `MAY` contain a property `fieldsMatch` that `MUST` be

Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc.

`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value.
`missingValues` dictates which values `SHOULD` be treated as missing values. Depending on implementation support for representing missing values, implementations `MAY` offer different ways of handling missingness when loading a field, including but not limited to: converting all missing values to `null`, loading missing values inline with a field's logical values, or loading the missing values for a field in a separate, additional column.

`missingValues` `MUST` be an `array` where each entry is a `string`.
`missingValues` `MUST` be an `array` where each entry is a unique `string`, or an `array` where each entry is an `object`.

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`.
If an `array` of `object`s is provided, each object `MUST` have a unique `value` and optional unique `label` property. The `value` property `MUST` be a `string` that represents the missing value. The optional `label` property `MUST` be a `string` that provides a human-readable label for the missing value. For example:

```json
"missingValues": [
{ "value": "", "label": "OMITTED" },
{ "value": "-99", "label": "REFUSED" }
]
```

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing values which are not of their type, for example a `number` field to have missing values indicated by `-`.

Examples:

Expand All @@ -141,6 +150,8 @@ Examples:
"missingValues": ["NaN", "-"]
```

When implementations choose to convert missing values to null, this conversion to `null` `MUST` be done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value.

#### `primaryKey`

A primary key is a field or set of fields that uniquely identifies each row in the table. Per SQL standards, the fields cannot be `null`, so their use in the primary key is equivalent to adding `required: true` to their [`constraints`](#constraints).
Expand Down Expand Up @@ -355,6 +366,55 @@ An example value for the field

See [Field Constraints](#field-constraints)

#### `categories` / `categoriesOrdered`

`string` and `integer` field types `MAY` include a `categories` property to restrict the field to a finite set of possible values (similar to an [`enum`](#enum) constraint) and indicate that the field `MAY` be loaded as a categorical data type if supported by the implementation. The `categories` property `MUST` be either (a) an array of unique values or (b) an array of objects, each with a unique `value` property. The logical representation of data in the field `MUST` exactly match one of the values in `categories`.

Suppose we have a field `fruit` with possible values `"apple"`, `"orange"`, or `"banana"`. The field definition would look like this if `categories` is (a) an array of values:

```json
{
"name": "fruit",
"type": "string",
"categories": ["apple", "orange", "banana"]
}
```

If `categories` is (b) an array of objects, each object `MAY` also have a `label` property, which when present, `MUST` be a `string`. Labels `MUST` be unique within `categories` definitions. In our example, this allows us to store our fruit with values `0`, `1`, and `2` in an `integer` field and label them as `"apple"`, `"orange"`, and `"banana"`:

```json
{
"name": "fruit",
"type": "integer",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
]
}
```

When the `categories` property is defined, it `MAY` be accompanied by a `categoriesOrdered` property in the field definition. When present, the `categoriesOrdered` property `MUST` be `boolean`. When `categoriesOrdered` is `true`, implementations `SHOULD` regard the order of appearance of the values in the `categories` property as their natural order. For example:

```json
{
"name": "agreementLevel",
"type": "integer",
"categories": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
],
"categoriesOrdered": true
}
```

When the property `categoriesOrdered` is `false`, implementations `SHOULD` assume that the categories do not have a natural order; when the property is not present, no assumption about the ordered nature of the values `SHOULD` be made.

An `enum` constraint `MAY` be added to a field with a `categories` property, but if so, the `enum` values `MUST` be a subset of the values in `categories`.

#### `missingValues`

A list of missing values for this field as per [Missing Values](#missingvalues) definition. If this property is defined, it takes precedence over the schema-level property and completely replaces it for the field without combining the values.
Expand Down

0 comments on commit e05eb9a

Please sign in to comment.