describe(...,:eltype) takes forever to complete on categorical columns #2693

pgagarinov · 2021-03-30T16:29:19Z

describe takes forever to complete when run on categorical columns of Union{Missing, String} type

Let us create a dataframe of random repeated strings.

using DataFrames, Random
rand_addresses = [randstring(20) for k in 1:6500];
data_mat=(rand([rand_addresses;missing],500000,30));
df = DataFrame(data_mat);
@time describe(df,:eltype);
`>  0.274541 seconds (813 allocations: 150.049 MiB)`

@time collect(skipmissing(df.x1));
> 0.008496 seconds (20 allocations: 5.001 MiB)

now let us make all columns categorical

categorical!(df);

and now this takes forever!:

@time collect(skipmissing(df.x1));

and as well as this:

@time describe(df,:eltype);

Julia 1.6
DataFrames v0.22.6

The text was updated successfully, but these errors were encountered:

pdeffebach · 2021-03-30T18:27:31Z

Can you test whether collect(skipmissing(x)) has different timing for categorical vs. non-categorical columns?

pgagarinov · 2021-03-30T18:42:30Z

@pdeffebach @time collect(skipmissing(df.x1)); takes forever if x1 is a categorical column in the example above. For plain columns there is no such problem. But why does @describe have to call collect when I just request the element type? Isn't the element type known in advance?

bkamins · 2021-03-30T18:50:10Z

I think we should skip collect altogether.

pdeffebach · 2021-03-30T18:54:58Z

I think we need collect because a lot of the functions used allocate anyway, like median, we also calculate last, which needs a Vector. Plus we allow people to use their own functions, and want things to "just work".

We only do this is there are missing values, which is the case in your example.

bkamins · 2021-03-30T18:58:51Z

Currently the docstring says something else:

For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.

Also median will collect whatever you pass to it, but at least we do it in a lazy way so it is not done when not needed.

pdeffebach · 2021-03-30T19:02:22Z

interesting! I can explore this in a PR.

pdeffebach · 2021-03-30T19:02:48Z

Regardless, this is still likely a CategoricalArrays bug, right? collect(skipmissing(x)) should not be exceptionally slow.

bkamins · 2021-03-30T19:03:56Z

@pdeffebach - right, it has been fixed AFAICT

@pgagarinov

Please update CategoricalArrays.jl and DataFrames.jl, currently you will see for your data:

julia> @time describe(df, :eltype);
  0.162089 seconds (813 allocations: 150.049 MiB, 4.00% gc time)

julia> @time describe(df, :eltype);
  0.136962 seconds (813 allocations: 150.049 MiB, 3.13% gc time)

julia> @time collect(skipmissing(df.x1));
  0.021541 seconds (630 allocations: 5.044 MiB, 76.99% compilation time)

julia> @time collect(skipmissing(df.x1));
  0.005248 seconds (20 allocations: 5.001 MiB)

bkamins · 2021-03-30T19:04:53Z

This is without the PR I am preparing. We have made a lot of such improvements with @nalimilan when working on faster joins as it turned out that it were the "pooled" vectors that were most offending performance-wise.

pgagarinov · 2021-03-30T19:13:53Z

@bkamins
I'm using the latest versions of both packages, see below.

I cannot reproduce the speedy results you provided (for some strange reason).

The slowdowns are only for relatively long strings (as per my example in description), it is not like all kinds of data types cause this slowdown.

bkamins · 2021-03-30T19:20:15Z

this is strange. What timings do you get?

I thought you are on the older version of DataFrames.jl as the following:

df = DataFrame(data_mat);

and

categorical!(df);

are deprecated. I would recommend you to run your code with deprecation warnings turned on as we are going to release DataFrames.jl 1.0 which will error on these lines.

pgagarinov · 2021-03-30T19:32:40Z

Here are my timings. The line in red never finishes

Thanks, will now run with deprecated warnings switched on.
I've triple-checked my results on 3 different linux machines - same problem everywhere.

nalimilan · 2021-03-30T20:15:44Z

I also see this (after replacing deprecated calls with the new syntax). That's because collect ends up calling push! repeatedly, and it needs to check for each value whether the source pool is a subset of the destination pool. This is a known issue and the only solution is to have a global table indicating whether pools are subsets/supersets/equal (or maybe more simply in this case, storing the hash of the pool and comparing it). A workaround would be to add a special method for collect(::SkipMissing{<:CategoricalArray})).

bkamins · 2021-03-30T20:20:13Z

It is strange I did not see this - maybe I did some mistake.

collect(::SkipMissing{<:CategoricalArray}))

I would add it.

(I thought the issue was fixed 😄)

bkamins · 2021-03-30T22:15:07Z

@pgagarinov - also please note that CategoricalVector was not designed for a use case you describe. It is optimized for cases when the number of groups is small. Use PooledVector if you have very many (but still duplicated) unique values.

pgagarinov · 2021-03-31T07:52:18Z

@pgagarinov - also please note that CategoricalVector was not designed for a use case you describe. It is optimized for cases when the number of groups is small. Use PooledVector if you have very many (but still duplicated) unique values.

Unfortunately, I don't have a choice - I want columns with repeated values one hot encoded for the gradient boosting model in MLJ.jl and in order to do that those columns need to be categorical, otherwise they are not treated as MultiClass{n} by MLJScientificTypes and won't be one-hot encoded automatically:

categorical!(df); are deprecated.

You may have your reasons but I can say that deprecating categorical! makes things less user-friendly as the recommended alternative is quite long and more difficult to remember.

This is without the PR I am preparing. We have made a lot of such improvements with @nalimilan when working on faster joins >as it turned out that it were the "pooled" vectors that were most offending performance-wise.

Does this mean we can expect better results for joins in Julia in this benchmark: https://h2oai.github.io/db-benchmark/ ?

bkamins · 2021-03-31T08:12:36Z

deprecating categorical! makes things less user-friendly as the recommended alternative is quite long and more difficult to remember.

We would prefer not to deprecate it, but Julia currently does not support conditional dependencies. The deprecation is verbose as it is fully general. A short version of the deprecation would be:

transform!(df, cols .=> categorical, renamecols=false)

where cols is the list of columns you want to change to catregorical.

Does this mean we can expect better results for joins in Julia in this benchmark

Yes

nalimilan · 2021-03-31T09:25:50Z

Actually there's an even shorter replacement for categorical! if you want to replace all columns and not just those holding strings: mapcols!(categorical, df).

bkamins · 2021-03-31T10:06:07Z

yes, but I assume that typically one wants to transform only some columns I think. As an Idea we might add cols argument to mapcols and mapcols! similar to what we do in other functions (like disallowmissing etc.)

nalimilan · 2021-03-31T11:47:31Z

CategoricalArrays 0.9.4 should fix this: JuliaRegistries/General#33246.

bkamins mentioned this issue Mar 30, 2021

do not use collect in describe #2694

Merged

bkamins added this to the 1.0 milestone Mar 30, 2021

bkamins added the performance label Mar 30, 2021

nalimilan mentioned this issue Mar 31, 2021

Add optimized method for collect(::SkipMissing{<: CatArrOrSub}} JuliaData/CategoricalArrays.jl#334

Merged

nalimilan closed this as completed in JuliaData/CategoricalArrays.jl#334 Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

describe(...,:eltype) takes forever to complete on categorical columns #2693

describe(...,:eltype) takes forever to complete on categorical columns #2693

pgagarinov commented Mar 30, 2021 •

edited

Loading

pdeffebach commented Mar 30, 2021

pgagarinov commented Mar 30, 2021

bkamins commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

bkamins commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

bkamins commented Mar 30, 2021

bkamins commented Mar 30, 2021

pgagarinov commented Mar 30, 2021 •

edited

Loading

bkamins commented Mar 30, 2021

pgagarinov commented Mar 30, 2021

nalimilan commented Mar 30, 2021

bkamins commented Mar 30, 2021

bkamins commented Mar 30, 2021

pgagarinov commented Mar 31, 2021 •

edited

Loading

bkamins commented Mar 31, 2021

nalimilan commented Mar 31, 2021

bkamins commented Mar 31, 2021

nalimilan commented Mar 31, 2021

describe(...,:eltype) takes forever to complete on categorical columns #2693

describe(...,:eltype) takes forever to complete on categorical columns #2693

Comments

pgagarinov commented Mar 30, 2021 • edited Loading

pdeffebach commented Mar 30, 2021

pgagarinov commented Mar 30, 2021

bkamins commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

bkamins commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

bkamins commented Mar 30, 2021

bkamins commented Mar 30, 2021

pgagarinov commented Mar 30, 2021 • edited Loading

bkamins commented Mar 30, 2021

pgagarinov commented Mar 30, 2021

nalimilan commented Mar 30, 2021

bkamins commented Mar 30, 2021

bkamins commented Mar 30, 2021

pgagarinov commented Mar 31, 2021 • edited Loading

bkamins commented Mar 31, 2021

nalimilan commented Mar 31, 2021

bkamins commented Mar 31, 2021

nalimilan commented Mar 31, 2021

pgagarinov commented Mar 30, 2021 •

edited

Loading

pgagarinov commented Mar 30, 2021 •

edited

Loading

pgagarinov commented Mar 31, 2021 •

edited

Loading