Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling categorical values from Arrow #110

Open
bkamins opened this issue Jan 30, 2019 · 2 comments
Open

Handling categorical values from Arrow #110

bkamins opened this issue Jan 30, 2019 · 2 comments

Comments

@bkamins
Copy link
Member

bkamins commented Jan 30, 2019

Given the way Arrow treats nominal variables maybe it would be cleaner that we read them in as PooledArray not CategoricalArray because they are essentially a PooledArray and recently we are considering adding more support for this type in DataFrames.jl.

CC @nalimilan

@nalimilan
Copy link
Member

Good question. Looking at the docs, it seems that levels in what Arrow calls a "dictionary encoded" column can appear in an arbitrary order, which we could consider as significant or not. The answer to that question should determine whether to return a CategoricalArray (order is meaningful) or a PooledArray (order is an implementation detail).

I guess a good way to asses this is to see whether saving a factor from R and loading it again preserves the custom order of levels. I think this also applies to Pandas.

@bkamins
Copy link
Member Author

bkamins commented Jan 30, 2019

You can check in Julia that saving CategoricalArray using Feather.jl and loading it back retains all levels (even if they are not present in the vector - it is enough that they are present in levels) but does not keep their order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants