-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not use collect in describe #2694
Conversation
Have you benchmarked this? I would expect collecting to be faster at least when we compute the median, since it needs to call |
this PR:
0.22.6 release:
So we have:
|
Yeah actually we could call |
if we decided to keep |
I see that there is this warning in
To be honest I'm not sure people will read this or follow it. Given there isn't a huge performance improvement, I would err on the side of safety and just use |
OK - @nalimilan - if you agree to add custom |
I have reverted the commit and just updated the docstring. |
Sorry, actually I wonder whether the best solution would be to call Here's a simple benchmark: julia> df = DataFrame(rand([1:10; missing], 500000, 30), :auto);
# With collect
julia> @btime describe(df, :mean, :min, :max, :nmissing, :eltype);
210.184 ms (896 allocations: 150.05 MiB)
# Without collect
julia> @btime describe(df, :mean, :min, :max, :nmissing, :eltype);
76.155 ms (356 allocations: 30.94 KiB) |
Let us do the following, we do
(i.e. when we use only internal functions that do not need to collect we would not collect) I will do an update to the PR. |
Though that would mean that users cannot benefit from maximum performance if they pass custom functions. Do we really expect that calling |
The tension is between:
Let us wait for @pdeffebach to comment what he things as an end user 😄. |
@pdeffebach - could you comment what you feel is preferable here given the discussion we had above? Thank you! |
Oh, sorry I missed this! I'm fine with not using collect, I guess. I would hate for very large datasets to be super laggy at I do think that the issue with custom functions is important. People should be able to do other things than just with
|
Sorry for bikeshedding with a new proposal. I guess, I don't have a super strong opinion on this. I think reporting the median is a good idea, but understand the appeal of not allocating. As for the custom functions, I'm fine with requiring that they work with a |
OK - so I have just removed |
Thank you! |
Fixes #2693
@pdeffebach - some tests (correctness and performance) would be welcome if you can spare some time.
In my opinion this change should be OK, as rather the called functions should add support for
skipmissing
value passed - and I would rather fix this limitation if possible.