Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe(...,:eltype) takes forever to complete on categorical columns #2693

Closed
pgagarinov opened this issue Mar 30, 2021 · 20 comments · Fixed by #2694 or JuliaData/CategoricalArrays.jl#334
Milestone

Comments

@pgagarinov
Copy link

pgagarinov commented Mar 30, 2021

describe takes forever to complete when run on categorical columns of Union{Missing, String} type

Let us create a dataframe of random repeated strings.

using DataFrames, Random
rand_addresses = [randstring(20) for k in 1:6500];
data_mat=(rand([rand_addresses;missing],500000,30));
df = DataFrame(data_mat);
@time describe(df,:eltype);
`>  0.274541 seconds (813 allocations: 150.049 MiB)`

@time collect(skipmissing(df.x1));
> 0.008496 seconds (20 allocations: 5.001 MiB)

now let us make all columns categorical

categorical!(df);

and now this takes forever!:

@time collect(skipmissing(df.x1));

and as well as this:

@time describe(df,:eltype);

Julia 1.6
DataFrames v0.22.6

@pdeffebach
Copy link
Contributor

Can you test whether collect(skipmissing(x)) has different timing for categorical vs. non-categorical columns?

@pgagarinov
Copy link
Author

@pdeffebach @time collect(skipmissing(df.x1)); takes forever if x1 is a categorical column in the example above. For plain columns there is no such problem. But why does @describe have to call collect when I just request the element type? Isn't the element type known in advance?

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

I think we should skip collect altogether.

@pdeffebach
Copy link
Contributor

I think we need collect because a lot of the functions used allocate anyway, like median, we also calculate last, which needs a Vector. Plus we allow people to use their own functions, and want things to "just work".

We only do this is there are missing values, which is the case in your example.

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

Currently the docstring says something else:

For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.

Also median will collect whatever you pass to it, but at least we do it in a lazy way so it is not done when not needed.

@pdeffebach
Copy link
Contributor

interesting! I can explore this in a PR.

@pdeffebach
Copy link
Contributor

Regardless, this is still likely a CategoricalArrays bug, right? collect(skipmissing(x)) should not be exceptionally slow.

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

@pdeffebach - right, it has been fixed AFAICT

@pgagarinov

Please update CategoricalArrays.jl and DataFrames.jl, currently you will see for your data:

julia> @time describe(df, :eltype);
  0.162089 seconds (813 allocations: 150.049 MiB, 4.00% gc time)

julia> @time describe(df, :eltype);
  0.136962 seconds (813 allocations: 150.049 MiB, 3.13% gc time)

julia> @time collect(skipmissing(df.x1));
  0.021541 seconds (630 allocations: 5.044 MiB, 76.99% compilation time)

julia> @time collect(skipmissing(df.x1));
  0.005248 seconds (20 allocations: 5.001 MiB)

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

This is without the PR I am preparing. We have made a lot of such improvements with @nalimilan when working on faster joins as it turned out that it were the "pooled" vectors that were most offending performance-wise.

@bkamins bkamins added this to the 1.0 milestone Mar 30, 2021
@pgagarinov
Copy link
Author

pgagarinov commented Mar 30, 2021

@bkamins
I'm using the latest versions of both packages, see below.
image

I cannot reproduce the speedy results you provided (for some strange reason).

The slowdowns are only for relatively long strings (as per my example in description), it is not like all kinds of data types cause this slowdown.

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

this is strange. What timings do you get?

I thought you are on the older version of DataFrames.jl as the following:

df = DataFrame(data_mat);

and

categorical!(df);

are deprecated. I would recommend you to run your code with deprecation warnings turned on as we are going to release DataFrames.jl 1.0 which will error on these lines.

@pgagarinov
Copy link
Author

  1. Here are my timings. The line in red never finishes

image

  1. Thanks, will now run with deprecated warnings switched on.

  2. I've triple-checked my results on 3 different linux machines - same problem everywhere.

@nalimilan
Copy link
Member

I also see this (after replacing deprecated calls with the new syntax). That's because collect ends up calling push! repeatedly, and it needs to check for each value whether the source pool is a subset of the destination pool. This is a known issue and the only solution is to have a global table indicating whether pools are subsets/supersets/equal (or maybe more simply in this case, storing the hash of the pool and comparing it). A workaround would be to add a special method for collect(::SkipMissing{<:CategoricalArray})).

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

It is strange I did not see this - maybe I did some mistake.

collect(::SkipMissing{<:CategoricalArray}))

I would add it.

(I thought the issue was fixed 😄)

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

@pgagarinov - also please note that CategoricalVector was not designed for a use case you describe. It is optimized for cases when the number of groups is small. Use PooledVector if you have very many (but still duplicated) unique values.

@pgagarinov
Copy link
Author

pgagarinov commented Mar 31, 2021

@pgagarinov - also please note that CategoricalVector was not designed for a use case you describe. It is optimized for cases when the number of groups is small. Use PooledVector if you have very many (but still duplicated) unique values.

Unfortunately, I don't have a choice - I want columns with repeated values one hot encoded for the gradient boosting model in MLJ.jl and in order to do that those columns need to be categorical, otherwise they are not treated as MultiClass{n} by MLJScientificTypes and won't be one-hot encoded automatically:

image

categorical!(df); are deprecated.

You may have your reasons but I can say that deprecating categorical! makes things less user-friendly as the recommended alternative is quite long and more difficult to remember.

This is without the PR I am preparing. We have made a lot of such improvements with @nalimilan when working on faster joins >as it turned out that it were the "pooled" vectors that were most offending performance-wise.

Does this mean we can expect better results for joins in Julia in this benchmark: https://h2oai.github.io/db-benchmark/ ?

@bkamins
Copy link
Member

bkamins commented Mar 31, 2021

deprecating categorical! makes things less user-friendly as the recommended alternative is quite long and more difficult to remember.

We would prefer not to deprecate it, but Julia currently does not support conditional dependencies. The deprecation is verbose as it is fully general. A short version of the deprecation would be:

transform!(df, cols .=> categorical, renamecols=false)

where cols is the list of columns you want to change to catregorical.

Does this mean we can expect better results for joins in Julia in this benchmark

Yes

@nalimilan
Copy link
Member

Actually there's an even shorter replacement for categorical! if you want to replace all columns and not just those holding strings: mapcols!(categorical, df).

@bkamins
Copy link
Member

bkamins commented Mar 31, 2021

yes, but I assume that typically one wants to transform only some columns I think. As an Idea we might add cols argument to mapcols and mapcols! similar to what we do in other functions (like disallowmissing etc.)

@nalimilan
Copy link
Member

CategoricalArrays 0.9.4 should fix this: JuliaRegistries/General#33246.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants