Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use refpool optimized method for integer grouping #2610

Merged
merged 7 commits into from
Jan 31, 2021
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Jan 23, 2021

main:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(x=rand(1:10, 100_000), y=rand(10:20, 100_000));

julia> @btime groupby(df, :x);
  730.118 μs (31 allocations: 2.53 MiB)

julia> @btime groupby(df, [:x, :y]);
  1.091 ms (38 allocations: 2.53 MiB)

julia> df = DataFrame(x=rand(1:1000, 100_000), y=rand(1000:2000, 100_000));

julia> @btime groupby(df, :x);
  1.009 ms (31 allocations: 2.53 MiB)

julia> @btime groupby(df, [:x, :y]);
  2.690 ms (38 allocations: 2.53 MiB)

PR (updated to latest commits):

jjulia> using DataFrames, BenchmarkTools

julia> df = DataFrame(x=rand(1:10, 100_000), y=rand(10:20, 100_000));

julia> @btime groupby(df, :x);
  186.219 μs (58 allocations: 784.46 KiB)

julia> @btime groupby(df, [:x, :y]);
  296.452 μs (85 allocations: 785.84 KiB)

julia> df = DataFrame(x=rand(1:1000, 100_000), y=rand(1000:2000, 100_000));

julia> @btime groupby(df, :x);
  187.802 μs (58 allocations: 785.43 KiB)

# Too many combinations, falling back to generic method
julia> @btime groupby(df, [:x, :y]);
  2.982 ms (98 allocations: 2.53 MiB)

Comment on lines 121 to 124
minval, maxval = extrema(x)
# Threshold chosen with the same rationale as the row_group_slots refpool method:
# refpool approach is faster but we should not allocate too much memory either
if maxval - minval + 1 <= 2 * length(x)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that even above this threshold the refpool approach is faster. But I'm not sure how much memory we're willing to allocate for speed.

@nalimilan
Copy link
Member Author

Last commit will fail until JuliaData/Missings.jl#126.

src/dataframerow/utils.jl Outdated Show resolved Hide resolved
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
src/dataframerow/utils.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Jan 24, 2021

Looks very nice. We only need to make sure we correctly disable trying this path in corner cases (like BigInt vector that has very large values, but with small range) as otherwise we will have overflow.

Comment on lines 106 to 107
@assert max < typemax(Int) - 1
@assert typemin(Int) <= widen(max) - widen(min) + 1 < typemax(Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestions:

  1. check that min is less of equal than max
  2. then no need to check typemin(Int)
  3. also +2 is needed as we potentially have Missing to handle
Suggested change
@assert max < typemax(Int) - 1
@assert typemin(Int) <= widen(max) - widen(min) + 1 < typemax(Int)
@assert min <= max < typemax(Int) - 1
@assert widen(max) - widen(min) + 2 < typemax(Int)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to refactor this part but I've tried to reuse your suggestion to simplify checks. Let me know what you think.

# We also have to avoid overflow
if typemin(Int) <= maxval + 1 < typemax(Int) &&
typemin(Int) <= minval <= typemax(Int) &&
widen(maxval) - widen(minval) + 1 <= 2 * length(x) < typemax(Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above you have to use +2 to allow to store missing

Suggested change
widen(maxval) - widen(minval) + 1 <= 2 * length(x) < typemax(Int)
widen(maxval) - widen(minval) + 2 <= 2 * length(x) < typemax(Int)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've changed the code to store this in an ngroups variable to make things clearer.

@bkamins
Copy link
Member

bkamins commented Jan 26, 2021

I would also add a test that on 32-bit machine uses slow path and on 64-bit machine uses fast path (as Int is defined differently on both) and another test that on 64-bit machine uses slow path because the values are too large.

@nalimilan
Copy link
Member Author

I've added checks and tests for overflows. I've also bumped into an already existing bug with skipmissing=true when there are only missing values so I've added more tests for that.

@bkamins
Copy link
Member

bkamins commented Jan 26, 2021

Have you seen my comments I left today in the morning?

@nalimilan
Copy link
Member Author

I would also add a test that on 32-bit machine uses slow path and on 64-bit machine uses fast path (as Int is defined differently on both) and another test that on 64-bit machine uses slow path because the values are too large.

I've added tests that the fast path is used just below typemax(Int), and correctness tests just above it.

if eltype(x.x) >: Missing && v === missing
return x.replacement
else
return Int(v - x.offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure v - x.offset will not overflow here?

ngroups + 1 <= 2 * length(x) <= typemax(Int)
T = eltype(x) >: Missing ? Union{Int, Missing} : Int
refpool′ = IntegerRefpool{T}(Int(ngroups))
refarray′ = IntegerRefarray(x, Int(minval) - 1, Int(ngroups) + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I guess this line answers my question above. If we initialize offset to what you propose then we are safe. Maybe just add a comment above what offset means for IntegerRefarray type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment.

# (note that it would be possible to allow minval and maxval to be outside of the
# range supported by Int by adding a type parameter for minval to IntegerRefarray)
if typemin(Int) < minval <= maxval < typemax(Int) &&
ngroups + 1 <= 2 * length(x) <= typemax(Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is interesting here that 2*length(x) can overflow if we are on 32 bit machine, but then it will be negative so the condition is correct (simply we will not use a fast path then). But if you want to be correct maybe better write Int64(2) * length(x)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, that's indeed not super clean. I've switched to Int64 (including in two similar occurrences).

if refpool !== nothing
return refpool, refarray
elseif x isa AbstractArray{<:Union{Real, Missing}} &&
all(v -> ismissing(v) | isinteger(v), x) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that using | here is is optimized by the compiler correctly? What I mean is that when eltype(x) <: Integer is this check a no-op?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point, but I had tested it and it works (which is really nice).

@bkamins
Copy link
Member

bkamins commented Jan 31, 2021

Looks good. Thank you. Can you just please make sure that in the points I have just raised we are clear what happens?

@bkamins
Copy link
Member

bkamins commented Jan 31, 2021

Thank you. Looks good now. These overflow things are tricky.

@nalimilan nalimilan merged commit f05fc73 into main Jan 31, 2021
@nalimilan nalimilan deleted the nl/intgrouping branch January 31, 2021 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants