Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not use collect in describe #2694

Merged
merged 4 commits into from
Apr 11, 2021
Merged

do not use collect in describe #2694

merged 4 commits into from
Apr 11, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 30, 2021

Fixes #2693

@pdeffebach - some tests (correctness and performance) would be welcome if you can spare some time.

In my opinion this change should be OK, as rather the called functions should add support for skipmissing value passed - and I would rather fix this limitation if possible.

@bkamins bkamins added this to the 1.0 milestone Mar 30, 2021
@nalimilan
Copy link
Member

Have you benchmarked this? I would expect collecting to be faster at least when we compute the median, since it needs to call collect anyway. Other operations can benefit from not having to skip missing values.

@bkamins
Copy link
Member Author

bkamins commented Mar 30, 2021

median will call collect anyway (it does it always). Here is a benchmark on the example from the #2693:

this PR:

julia> @time describe(df);
  1.757966 seconds (2.24 k allocations: 150.133 MiB, 0.75% gc time)

julia> @time describe(df);
  1.745534 seconds (2.24 k allocations: 150.133 MiB, 0.36% gc time)

0.22.6 release:

julia> @time describe(df);
  1.715486 seconds (2.27 k allocations: 264.559 MiB, 0.33% gc time)

julia> @time describe(df);
  1.700200 seconds (2.27 k allocations: 264.559 MiB, 0.28% gc time)

So we have:

  1. much less allocations
  2. a bit worse performance (as probably some functions are faster when called on vector rather than SkipMissings)

@nalimilan
Copy link
Member

median will call collect anyway (it does it always).

Yeah actually we could call median! if we do that after calling all other functions. That would avoid a copy.

@bkamins
Copy link
Member Author

bkamins commented Mar 30, 2021

if we decided to keep collect. In this PR I have removed collect.

@pdeffebach
Copy link
Contributor

I see that there is this warning in describe:

If custom functions are provided, they are called repeatedly with the
vector corresponding to each column as the only argument. For columns
allowing for missing values, the vector is wrapped in a call to
skipmissing: custom functions must therefore support such objects (and
not only vectors), and cannot access missing values.

To be honest I'm not sure people will read this or follow it. Given there isn't a huge performance improvement, I would err on the side of safety and just use collect and get rid of that note.

@bkamins bkamins modified the milestones: 1.0, 1.x Mar 30, 2021
@bkamins
Copy link
Member Author

bkamins commented Mar 30, 2021

OK - @nalimilan - if you agree to add custom collect for skipmissing in CategoricalArrays.jl I would just update the docstring in this PR and retain collect.

@bkamins
Copy link
Member Author

bkamins commented Mar 30, 2021

I have reverted the commit and just updated the docstring.

@nalimilan
Copy link
Member

Sorry, actually I wonder whether the best solution would be to call collect only internally as an optimization when we compute the median (via median!), but always pass the SkipMissing iterators to custom functions. That way we would still be able to avoid making copies at all when we don't compute the median or quantiles. Otherwise we would be locked into a suboptimal implementation in terms of performance. What do you think?

Here's a simple benchmark:

julia> df = DataFrame(rand([1:10; missing], 500000, 30), :auto);

# With collect
julia> @btime describe(df, :mean, :min, :max, :nmissing, :eltype);
  210.184 ms (896 allocations: 150.05 MiB)

# Without collect
julia> @btime describe(df, :mean, :min, :max, :nmissing, :eltype);
  76.155 ms (356 allocations: 30.94 KiB)

@bkamins
Copy link
Member Author

bkamins commented Mar 31, 2021

Let us do the following, we do collect if we:

  1. calculate median or quantiles
  2. user passes a custom function (which, as @pdeffebach might not support SkipMissings)

(i.e. when we use only internal functions that do not need to collect we would not collect)

I will do an update to the PR.

@nalimilan
Copy link
Member

Though that would mean that users cannot benefit from maximum performance if they pass custom functions. Do we really expect that calling collect manually would be annoying for users? They should be used to handling skipmissing as it's basically needed everywhere.

@bkamins
Copy link
Member Author

bkamins commented Mar 31, 2021

The tension is between:

  1. calling collect once and then all functions can use collected value which is faster and collects only once
  2. not calling collect which means that functions that do not need collect can allocate less memory

Let us wait for @pdeffebach to comment what he things as an end user 😄.

@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2021

@pdeffebach - could you comment what you feel is preferable here given the discussion we had above? Thank you!

@pdeffebach
Copy link
Contributor

Oh, sorry I missed this!

I'm fine with not using collect, I guess. I would hate for very large datasets to be super laggy at describe.

I do think that the issue with custom functions is important. People should be able to do other things than just with skipmissing. Here's what I propose

  1. Don't use collect in general. Remove median from the default output and only collect unless median or :q25 etc. are called.
  2. When people pass custom functions, pass the whole vector. The user can deal with missings on their own.

@pdeffebach
Copy link
Contributor

Sorry for bikeshedding with a new proposal. I guess, I don't have a super strong opinion on this. I think reporting the median is a good idea, but understand the appeal of not allocating.

As for the custom functions, I'm fine with requiring that they work with a SkipMissing. You are right, Milan, that they should know how to work with this.

@bkamins
Copy link
Member Author

bkamins commented Apr 10, 2021

OK - so I have just removed collect (as median will collect it automatically anyway), reverted the docstring (as now it matches what we do) and despecialized some methods to avoid inference time hit of the change we do.

@bkamins bkamins modified the milestones: 1.x, 1.0 Apr 10, 2021
@bkamins bkamins merged commit 936d115 into main Apr 11, 2021
@bkamins bkamins deleted the bk/improve_describe branch April 11, 2021 15:25
@bkamins
Copy link
Member Author

bkamins commented Apr 11, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

describe(...,:eltype) takes forever to complete on categorical columns
3 participants