Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add categorical/factors dtype #261

Closed
cigrainger opened this issue Jun 22, 2022 · 3 comments
Closed

Add categorical/factors dtype #261

cigrainger opened this issue Jun 22, 2022 · 3 comments
Assignees
Labels
kind:feature New feature or request note:discussion Further information is requested

Comments

@cigrainger
Copy link
Member

@cigrainger
Copy link
Member Author

@jsonbecker Do you have thoughts on ergonomics/ease of use with factors? What can we do to make them easy to work with and intuitive beyond copying R?

@cigrainger cigrainger modified the milestones: v0.2, v0.3 Jun 22, 2022
@cigrainger cigrainger self-assigned this Jun 22, 2022
@cigrainger cigrainger added kind:feature New feature or request note:discussion Further information is requested labels Jun 22, 2022
@jsonbecker
Copy link

I have controversial opinions about factors. In my mind, there are two reasons to use factors:

  1. Data validation
  2. Other functions/packages further down the chain can be smart about discrete values with orders reducing code I have to write later.

I think reason (1) is dumb. Assertive programming of various types are better for explicit validation. On (2), I'm not sure how much the broader Nx ecosystem is going to be "smart" about these things.

I find factors in R fantastic once they exist correctly because much of the ecosystem can then interpret them well. I find the ergonomics of working directly with them to be abysmal.

So how can working with factors be less abysmal? I'd say a few things help.

  1. Factors have to be explicitly cast as such-- nothing should attempt to coerce a factor at any time, and it's better to error when supplied a non-factor data type if a factor is expected.
  2. Factors should have an easy to work with interface to the allowable values and their order. This should be easily access and updated without having to rewrite the data-- it's really just a map of sequence and allowable values.
  3. Unordered factors are pointless. Semantically they exist, but I see no reason to distinguish them at the data type level. Some operations will be order aware, some will not. If the factor is unordered, being aware of an explicit order existing won't have any effect since it's effectively random. It's sort of like setting a seed. Who cares it's consistently in an order that doesn't matter, and you can always randomize the order again yourself easily if that interface is simple to work with.

@josevalim josevalim removed this from the v0.3 milestone Sep 9, 2022
@josevalim
Copy link
Member

Categorical types are in, mostly for integration with Nx. We can revisit this issue with more integrated features later on if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature New feature or request note:discussion Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants