Add categorical/factors dtype #261

cigrainger · 2022-06-22T21:26:35Z

Pandas: https://pandas.pydata.org/docs/user_guide/categorical.html
R: https://r4ds.had.co.nz/factors.html
Polars: https://docs.rs/polars/0.21.1/polars/docs/performance/index.html#categorical-type

cigrainger · 2022-06-22T23:02:03Z

@jsonbecker Do you have thoughts on ergonomics/ease of use with factors? What can we do to make them easy to work with and intuitive beyond copying R?

jsonbecker · 2022-06-24T13:32:26Z

I have controversial opinions about factors. In my mind, there are two reasons to use factors:

Data validation
Other functions/packages further down the chain can be smart about discrete values with orders reducing code I have to write later.

I think reason (1) is dumb. Assertive programming of various types are better for explicit validation. On (2), I'm not sure how much the broader Nx ecosystem is going to be "smart" about these things.

I find factors in R fantastic once they exist correctly because much of the ecosystem can then interpret them well. I find the ergonomics of working directly with them to be abysmal.

So how can working with factors be less abysmal? I'd say a few things help.

Factors have to be explicitly cast as such-- nothing should attempt to coerce a factor at any time, and it's better to error when supplied a non-factor data type if a factor is expected.
Factors should have an easy to work with interface to the allowable values and their order. This should be easily access and updated without having to rewrite the data-- it's really just a map of sequence and allowable values.
Unordered factors are pointless. Semantically they exist, but I see no reason to distinguish them at the data type level. Some operations will be order aware, some will not. If the factor is unordered, being aware of an explicit order existing won't have any effect since it's effectively random. It's sort of like setting a seed. Who cares it's consistently in an order that doesn't matter, and you can always randomize the order again yourself easily if that interface is simple to work with.

josevalim · 2023-01-11T10:38:39Z

Categorical types are in, mostly for integration with Nx. We can revisit this issue with more integrated features later on if desired.

cigrainger modified the milestones: v0.2, v0.3 Jun 22, 2022

cigrainger self-assigned this Jun 22, 2022

cigrainger added kind:feature New feature or request note:discussion Further information is requested labels Jun 22, 2022

josevalim removed this from the v0.3 milestone Sep 9, 2022

philss mentioned this issue Jan 8, 2023

Add basic support for categorical dtype #464

Merged

josevalim closed this as completed Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add categorical/factors dtype #261

Add categorical/factors dtype #261

cigrainger commented Jun 22, 2022

cigrainger commented Jun 22, 2022

jsonbecker commented Jun 24, 2022

josevalim commented Jan 11, 2023

Add categorical/factors dtype #261

Add categorical/factors dtype #261

Comments

cigrainger commented Jun 22, 2022

cigrainger commented Jun 22, 2022

jsonbecker commented Jun 24, 2022

josevalim commented Jan 11, 2023