Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized method for collect(::SkipMissing{<: CatArrOrSub}} #334

Merged
merged 2 commits into from
Mar 31, 2021

Conversation

nalimilan
Copy link
Member

The fallback method is currently extremely slow as it checks that the destination pool is a superset of the source pool for each entry. This optimization would be faster anyway even if we fixed that.

Fixes JuliaData/DataFrames.jl#2693.

@nalimilan nalimilan requested a review from bkamins March 31, 2021 07:32
The fallback method is currently extremely slow as it checks that the destination
pool is a superset of the source pool for each entry. This optimization
would be faster anyway even if we fixed that.
src/array.jl Outdated
# Defined for performance
function collect(x::Base.SkipMissing{<: CatArrOrSub{T}}) where {T}
r = refs(x.x)
return CategoricalVector{T}(r[r .> 0], copy(pool(x.x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is incorrect for 0-dimensional arrays or views:

julia> y = categorical(fill("a"))
0-dimensional CategoricalArrays.CategoricalArray{String,0,UInt32}:
"a"

julia> y[y.refs .> 0]
ERROR: ArgumentError: invalid index: true of type Bool

you have to special case this scenario. Other than that it looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think filter should work in all cases, right?

I also realized I forgot to call nonmissingtype.

@nalimilan
Copy link
Member Author

Printing tests need to be updated to pass on Julia 1.6.

@nalimilan nalimilan merged commit 18671b1 into master Mar 31, 2021
@nalimilan nalimilan deleted the nl/skipmissing branch March 31, 2021 11:23
nalimilan added a commit that referenced this pull request Mar 31, 2021
The fallback method is currently extremely slow as it checks that the destination
pool is a superset of the source pool for each entry. This optimization
would be faster anyway even if we fixed that.
@nalimilan
Copy link
Member Author

Backported to 0.9.4: JuliaRegistries/General#33246

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

describe(...,:eltype) takes forever to complete on categorical columns
2 participants