Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Nullable compatibility and coordination #93

Open
tshort opened this issue Dec 11, 2015 · 5 comments
Open

Nullable compatibility and coordination #93

tshort opened this issue Dec 11, 2015 · 5 comments

Comments

@tshort
Copy link

tshort commented Dec 11, 2015

I've drafted a package for pooled elements at the following link. The main purpose of this package is to speed up grouping and joining in DataFrames. If this is used in DataFrames, it will also reduce the use of PooledDataArrays in DataFrames.

https://github.com/tshort/PooledElements.jl

Pooled elements and arrays use an integer or integer array to reference a pool of values. This is similar to categorical data. In PooledElements.jl, I've used an integer reference of zero as a null value. Like NullableArrays, each element is type stable. I've tried to replicate the API from NullableArrays and Base.

I'm starting this issue to make sure we coordinate. Some areas of coordination include:

  • Conversions -- I haven't written any conversions to/from NullableArrays.
  • API -- So far, I haven't run into any issues trying to follow the API from NullableArrays. I did monkey-patch the anynull method here. The way it was written, it wouldn't work with AbstractArrays filled with PooledElements.

I'm not sure exactly how to do it, but it might be good to have a trait that indicates whether an AbstractArray supports Nulls. Then, it might be easier to support operations on Nullables in arrays for multiple array types.

@davidagold
Copy link
Contributor

This sounds great. I'll keep an eye on this issue, and I'll take a look at the anynull method to see if there's any good reason not to incorporate your modification.

@quinnj
Copy link
Member

quinnj commented Aug 25, 2016

Hey @tshort, have you seen https://github.com/nalimilan/CategoricalArrays.jl? Just wondering how the PooledElements.jl package compares in approach (since I know @nalimilan has taken a very similar approach it sounds like).

@tshort
Copy link
Author

tshort commented Aug 30, 2016

@quinnj, I have seen @nalimilan's package. We both used code from @johnmyleswhite as a starting point. My package is on hold until the dust settles with the integration of NullableArrays into DataFrames. I'm hopeful that CategoricalArrays will meet my needs, and I won't need PooledElements.

One area of difference is that my PooledString type is an AbstractString. With that, it fits in better with standard string usage.

@nalimilan
Copy link
Member

AFAIK the major differences between our packages are (@tshort, correct me if I'm wrong):

  • whether null values are supported by default (PE.jl), or by a special array type (CA.jl)
  • whether the pool is global (PE.jl) or array-specific (CA.jl) by default
  • whether only strings are supported (PE.jl), or any type (CA.jl) : the former allows declaring the element type as <: AbstractString, though maybe we could do that in the latter case too
  • whether levels can be ordered (CA.jl) or not (PE.jl); CA.jl also offers the feature that levels can be reordered without changing the underlying integer codes at all, which can be useful for performance at the expense of some additional complexity (two indexes)

@tshort
Copy link
Author

tshort commented Aug 31, 2016

That's pretty accurate, @nalimilan. PE.jl does support pooling items other than strings, but that part isn't well tested or fleshed out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants