You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given the way Arrow treats nominal variables maybe it would be cleaner that we read them in as PooledArray not CategoricalArray because they are essentially a PooledArray and recently we are considering adding more support for this type in DataFrames.jl.
Good question. Looking at the docs, it seems that levels in what Arrow calls a "dictionary encoded" column can appear in an arbitrary order, which we could consider as significant or not. The answer to that question should determine whether to return a CategoricalArray (order is meaningful) or a PooledArray (order is an implementation detail).
I guess a good way to asses this is to see whether saving a factor from R and loading it again preserves the custom order of levels. I think this also applies to Pandas.
You can check in Julia that saving CategoricalArray using Feather.jl and loading it back retains all levels (even if they are not present in the vector - it is enough that they are present in levels) but does not keep their order.
Given the way Arrow treats nominal variables maybe it would be cleaner that we read them in as
PooledArray
notCategoricalArray
because they are essentially aPooledArray
and recently we are considering adding more support for this type in DataFrames.jl.CC @nalimilan
The text was updated successfully, but these errors were encountered: