-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
describe(...,:eltype) takes forever to complete on categorical columns #2693
describe(...,:eltype) takes forever to complete on categorical columns #2693
Comments
Can you test whether |
@pdeffebach |
I think we should skip |
I think we need collect because a lot of the functions used allocate anyway, like We only do this is there are missing values, which is the case in your example. |
Currently the docstring says something else:
Also |
interesting! I can explore this in a PR. |
Regardless, this is still likely a CategoricalArrays bug, right? |
@pdeffebach - right, it has been fixed AFAICT Please update CategoricalArrays.jl and DataFrames.jl, currently you will see for your data:
|
This is without the PR I am preparing. We have made a lot of such improvements with @nalimilan when working on faster joins as it turned out that it were the "pooled" vectors that were most offending performance-wise. |
@bkamins I cannot reproduce the speedy results you provided (for some strange reason). The slowdowns are only for relatively long strings (as per my example in description), it is not like all kinds of data types cause this slowdown. |
this is strange. What timings do you get? I thought you are on the older version of DataFrames.jl as the following:
and
are deprecated. I would recommend you to run your code with deprecation warnings turned on as we are going to release DataFrames.jl 1.0 which will error on these lines. |
I also see this (after replacing deprecated calls with the new syntax). That's because |
It is strange I did not see this - maybe I did some mistake.
I would add it. (I thought the issue was fixed 😄) |
@pgagarinov - also please note that |
Unfortunately, I don't have a choice - I want columns with repeated values one hot encoded for the gradient boosting model in MLJ.jl and in order to do that those columns need to be categorical, otherwise they are not treated as MultiClass{n} by MLJScientificTypes and won't be one-hot encoded automatically:
You may have your reasons but I can say that deprecating
Does this mean we can expect better results for joins in Julia in this benchmark: https://h2oai.github.io/db-benchmark/ ? |
We would prefer not to deprecate it, but Julia currently does not support conditional dependencies. The deprecation is verbose as it is fully general. A short version of the deprecation would be:
where
Yes |
Actually there's an even shorter replacement for |
yes, but I assume that typically one wants to transform only some columns I think. As an Idea we might add |
CategoricalArrays 0.9.4 should fix this: JuliaRegistries/General#33246. |
describe
takes forever to complete when run on categorical columns of Union{Missing, String} typeLet us create a dataframe of random repeated strings.
@time collect(skipmissing(df.x1));
> 0.008496 seconds (20 allocations: 5.001 MiB)
now let us make all columns categorical
categorical!(df);
and now this takes forever!:
@time collect(skipmissing(df.x1));
and as well as this:
@time describe(df,:eltype);
Julia 1.6
DataFrames v0.22.6
The text was updated successfully, but these errors were encountered: