-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use refpool optimized method for integer grouping #2610
Conversation
src/dataframerow/utils.jl
Outdated
minval, maxval = extrema(x) | ||
# Threshold chosen with the same rationale as the row_group_slots refpool method: | ||
# refpool approach is faster but we should not allocate too much memory either | ||
if maxval - minval + 1 <= 2 * length(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that even above this threshold the refpool
approach is faster. But I'm not sure how much memory we're willing to allocate for speed.
Last commit will fail until JuliaData/Missings.jl#126. |
Looks very nice. We only need to make sure we correctly disable trying this path in corner cases (like |
src/dataframerow/utils.jl
Outdated
@assert max < typemax(Int) - 1 | ||
@assert typemin(Int) <= widen(max) - widen(min) + 1 < typemax(Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestions:
- check that
min
is less of equal thanmax
- then no need to check
typemin(Int)
- also
+2
is needed as we potentially haveMissing
to handle
@assert max < typemax(Int) - 1 | |
@assert typemin(Int) <= widen(max) - widen(min) + 1 < typemax(Int) | |
@assert min <= max < typemax(Int) - 1 | |
@assert widen(max) - widen(min) + 2 < typemax(Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to refactor this part but I've tried to reuse your suggestion to simplify checks. Let me know what you think.
src/dataframerow/utils.jl
Outdated
# We also have to avoid overflow | ||
if typemin(Int) <= maxval + 1 < typemax(Int) && | ||
typemin(Int) <= minval <= typemax(Int) && | ||
widen(maxval) - widen(minval) + 1 <= 2 * length(x) < typemax(Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above you have to use +2
to allow to store missing
widen(maxval) - widen(minval) + 1 <= 2 * length(x) < typemax(Int) | |
widen(maxval) - widen(minval) + 2 <= 2 * length(x) < typemax(Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I've changed the code to store this in an ngroups
variable to make things clearer.
I would also add a test that on 32-bit machine uses slow path and on 64-bit machine uses fast path (as |
I've added checks and tests for overflows. I've also bumped into an already existing bug with |
Have you seen my comments I left today in the morning? |
I've added tests that the fast path is used just below |
if eltype(x.x) >: Missing && v === missing | ||
return x.replacement | ||
else | ||
return Int(v - x.offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure v - x.offset
will not overflow here?
ngroups + 1 <= 2 * length(x) <= typemax(Int) | ||
T = eltype(x) >: Missing ? Union{Int, Missing} : Int | ||
refpool′ = IntegerRefpool{T}(Int(ngroups)) | ||
refarray′ = IntegerRefarray(x, Int(minval) - 1, Int(ngroups) + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I guess this line answers my question above. If we initialize offset
to what you propose then we are safe. Maybe just add a comment above what offset
means for IntegerRefarray
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment.
src/dataframerow/utils.jl
Outdated
# (note that it would be possible to allow minval and maxval to be outside of the | ||
# range supported by Int by adding a type parameter for minval to IntegerRefarray) | ||
if typemin(Int) < minval <= maxval < typemax(Int) && | ||
ngroups + 1 <= 2 * length(x) <= typemax(Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is interesting here that 2*length(x)
can overflow if we are on 32 bit machine, but then it will be negative so the condition is correct (simply we will not use a fast path then). But if you want to be correct maybe better write Int64(2) * length(x)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, that's indeed not super clean. I've switched to Int64
(including in two similar occurrences).
if refpool !== nothing | ||
return refpool, refarray | ||
elseif x isa AbstractArray{<:Union{Real, Missing}} && | ||
all(v -> ismissing(v) | isinteger(v), x) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked that using |
here is is optimized by the compiler correctly? What I mean is that when eltype(x) <: Integer
is this check a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good point, but I had tested it and it works (which is really nice).
Looks good. Thank you. Can you just please make sure that in the points I have just raised we are clear what happens? |
Thank you. Looks good now. These overflow things are tricky. |
main:
PR (updated to latest commits):