do not use collect in describe #2694

bkamins · 2021-03-30T19:05:57Z

@pdeffebach - some tests (correctness and performance) would be welcome if you can spare some time.

In my opinion this change should be OK, as rather the called functions should add support for skipmissing value passed - and I would rather fix this limitation if possible.

nalimilan · 2021-03-30T19:12:17Z

Have you benchmarked this? I would expect collecting to be faster at least when we compute the median, since it needs to call collect anyway. Other operations can benefit from not having to skip missing values.

bkamins · 2021-03-30T19:15:37Z

median will call collect anyway (it does it always). Here is a benchmark on the example from the #2693:

this PR:

julia> @time describe(df);
  1.757966 seconds (2.24 k allocations: 150.133 MiB, 0.75% gc time)

julia> @time describe(df);
  1.745534 seconds (2.24 k allocations: 150.133 MiB, 0.36% gc time)

0.22.6 release:

julia> @time describe(df);
  1.715486 seconds (2.27 k allocations: 264.559 MiB, 0.33% gc time)

julia> @time describe(df);
  1.700200 seconds (2.27 k allocations: 264.559 MiB, 0.28% gc time)

So we have:

much less allocations
a bit worse performance (as probably some functions are faster when called on vector rather than SkipMissings)

nalimilan · 2021-03-30T20:06:45Z

median will call collect anyway (it does it always).

Yeah actually we could call median! if we do that after calling all other functions. That would avoid a copy.

bkamins · 2021-03-30T20:11:50Z

if we decided to keep collect. In this PR I have removed collect.

pdeffebach · 2021-03-30T20:41:33Z

I see that there is this warning in describe:

If custom functions are provided, they are called repeatedly with the
vector corresponding to each column as the only argument. For columns
allowing for missing values, the vector is wrapped in a call to
skipmissing: custom functions must therefore support such objects (and
not only vectors), and cannot access missing values.

To be honest I'm not sure people will read this or follow it. Given there isn't a huge performance improvement, I would err on the side of safety and just use collect and get rid of that note.

bkamins · 2021-03-30T21:59:38Z

OK - @nalimilan - if you agree to add custom collect for skipmissing in CategoricalArrays.jl I would just update the docstring in this PR and retain collect.

bkamins · 2021-03-30T22:14:04Z

I have reverted the commit and just updated the docstring.

nalimilan · 2021-03-31T07:19:24Z

Sorry, actually I wonder whether the best solution would be to call collect only internally as an optimization when we compute the median (via median!), but always pass the SkipMissing iterators to custom functions. That way we would still be able to avoid making copies at all when we don't compute the median or quantiles. Otherwise we would be locked into a suboptimal implementation in terms of performance. What do you think?

Here's a simple benchmark:

julia> df = DataFrame(rand([1:10; missing], 500000, 30), :auto);

# With collect
julia> @btime describe(df, :mean, :min, :max, :nmissing, :eltype);
  210.184 ms (896 allocations: 150.05 MiB)

# Without collect
julia> @btime describe(df, :mean, :min, :max, :nmissing, :eltype);
  76.155 ms (356 allocations: 30.94 KiB)

bkamins · 2021-03-31T08:14:48Z

Let us do the following, we do collect if we:

calculate median or quantiles
user passes a custom function (which, as @pdeffebach might not support SkipMissings)

(i.e. when we use only internal functions that do not need to collect we would not collect)

I will do an update to the PR.

nalimilan · 2021-03-31T09:24:37Z

Though that would mean that users cannot benefit from maximum performance if they pass custom functions. Do we really expect that calling collect manually would be annoying for users? They should be used to handling skipmissing as it's basically needed everywhere.

bkamins · 2021-03-31T10:03:05Z

The tension is between:

calling collect once and then all functions can use collected value which is faster and collects only once
not calling collect which means that functions that do not need collect can allocate less memory

Let us wait for @pdeffebach to comment what he things as an end user 😄.

bkamins · 2021-04-09T21:48:26Z

@pdeffebach - could you comment what you feel is preferable here given the discussion we had above? Thank you!

pdeffebach · 2021-04-09T22:22:46Z

Oh, sorry I missed this!

I'm fine with not using collect, I guess. I would hate for very large datasets to be super laggy at describe.

I do think that the issue with custom functions is important. People should be able to do other things than just with skipmissing. Here's what I propose

Don't use collect in general. Remove median from the default output and only collect unless median or :q25 etc. are called.
When people pass custom functions, pass the whole vector. The user can deal with missings on their own.

pdeffebach · 2021-04-10T19:22:53Z

Sorry for bikeshedding with a new proposal. I guess, I don't have a super strong opinion on this. I think reporting the median is a good idea, but understand the appeal of not allocating.

As for the custom functions, I'm fine with requiring that they work with a SkipMissing. You are right, Milan, that they should know how to work with this.

bkamins · 2021-04-10T23:10:21Z

OK - so I have just removed collect (as median will collect it automatically anyway), reverted the docstring (as now it matches what we do) and despecialized some methods to avoid inference time hit of the change we do.

bkamins · 2021-04-11T15:25:25Z

Thank you!

do not use collect in describe

7240db1

bkamins added the performance label Mar 30, 2021

bkamins added this to the 1.0 milestone Mar 30, 2021

bkamins modified the milestones: 1.0, 1.x Mar 30, 2021

change only the docs

d04ff55

pdeffebach approved these changes Mar 30, 2021

View reviewed changes

bkamins added 2 commits April 11, 2021 01:07

do not collect

bf741b3

revert docstring change

db6cc45

bkamins modified the milestones: 1.x, 1.0 Apr 10, 2021

nalimilan approved these changes Apr 11, 2021

View reviewed changes

bkamins merged commit 936d115 into main Apr 11, 2021

bkamins deleted the bk/improve_describe branch April 11, 2021 15:25

bkamins mentioned this pull request Apr 21, 2021

Improve the performance of describe() in the case of missing values. #2731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not use collect in describe #2694

do not use collect in describe #2694

bkamins commented Mar 30, 2021

nalimilan commented Mar 30, 2021

bkamins commented Mar 30, 2021

nalimilan commented Mar 30, 2021

bkamins commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

bkamins commented Mar 30, 2021

bkamins commented Mar 30, 2021

nalimilan commented Mar 31, 2021

bkamins commented Mar 31, 2021

nalimilan commented Mar 31, 2021

bkamins commented Mar 31, 2021

bkamins commented Apr 9, 2021

pdeffebach commented Apr 9, 2021

pdeffebach commented Apr 10, 2021

bkamins commented Apr 10, 2021

bkamins commented Apr 11, 2021

do not use collect in describe #2694

do not use collect in describe #2694

Conversation

bkamins commented Mar 30, 2021

nalimilan commented Mar 30, 2021

bkamins commented Mar 30, 2021

nalimilan commented Mar 30, 2021

bkamins commented Mar 30, 2021

pdeffebach commented Mar 30, 2021

bkamins commented Mar 30, 2021

bkamins commented Mar 30, 2021

nalimilan commented Mar 31, 2021

bkamins commented Mar 31, 2021

nalimilan commented Mar 31, 2021

bkamins commented Mar 31, 2021

bkamins commented Apr 9, 2021

pdeffebach commented Apr 9, 2021

pdeffebach commented Apr 10, 2021

bkamins commented Apr 10, 2021

bkamins commented Apr 11, 2021