Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix float grouping #2791

Merged
merged 15 commits into from
Jun 28, 2021
Merged

Fix float grouping #2791

merged 15 commits into from
Jun 28, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 18, 2021

Fixes #2790

@bkamins bkamins added the bug label Jun 18, 2021
@bkamins bkamins added this to the patch milestone Jun 18, 2021
@bkamins bkamins requested a review from nalimilan June 18, 2021 12:05
@nalimilan
Copy link
Member

Woops. Though maybe it would be worth checking whether the data contains -0.0, and use the fast path if not? Is there also a problem with NaN and Inf?

@bkamins
Copy link
Member Author

bkamins commented Jun 18, 2021

Is there also a problem with NaN and Inf?

No, it is OK, because:

julia> isinteger(Inf)
false

julia> isinteger(NaN)
false

We could add a check against -0.0. However, I feel that doing groupby floats is rare (and in general discouraged) therefore I thought we can skip the fast path in this case.

If you would prefer to add special handling of -0.0 instead please let me know and I can add this instead of what I propose now.

@nalimilan
Copy link
Member

As you prefer, but I tend to think that it's good to support the fast path in as many cases as possible unless that's hard to do. For example, R uses floats by default when reading CSV files, so people could end up with float columns with only integers instead of Integer columns when transferring from or loading Arrow files written in R. That may even end up in a blog post comparing DataFrames.jl's performance against R. It's hard to anticipate what weird things people might do...

@bkamins
Copy link
Member Author

bkamins commented Jun 19, 2021

As you prefer, but I tend to think that it's good to support the fast path in as many cases as possible unless that's hard to do.

I have implemented it and added tests showing the consequences. Let us decide which approach is better.

src/groupeddataframe/utils.jl Outdated Show resolved Hide resolved
src/groupeddataframe/utils.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jun 24, 2021

I will finalize this PR with the current design (treat float as integer if possible). After thinking I think it is less breaking, as we just have to add only the following condition to the current rules:

-0.0 is considered not to be an integer

(so doing it this way is less breaking)

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2021

@nalimilan - I have made the requested cleanups. This should be good to be merged. Thank you!

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/utils.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
@bkamins bkamins merged commit f0b5a57 into main Jun 28, 2021
@bkamins bkamins deleted the fix_float_grouping branch June 28, 2021 11:43
@bkamins
Copy link
Member Author

bkamins commented Jun 28, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inconsistency of groupby() for -0.0
3 participants