-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement faster floating-point isless
#39090
Conversation
I really can't think of anything to say other than wow. |
@stev47 Can you provide the before/after benchmark times? Just for curious onlookers who may not check out the branch and run the script themselves. |
e8eeba7
to
107668f
Compare
I've updated the benchmarks. The times got stripped out by accident, since |
Is the only reason we are not using @inline function isless(a::T, b::T) where T<:Union{Float16,Float32,Float64}
(isnan(a) || isnan(b)) && return !isnan(a)
return a < b
end On my computer: Timings of this change in the only case where the timings are significantly different (i.e. arrays containing mixed NaN's): a = [rand((rand(), NaN)) for _ in 1:1000000];
@btime sort($a, lt=(a,b)->isless(a,b));
# 42% faster (this PR)
# 31% faster (using isless with < directly) Otherwise the changes within 0-3%. I'm not necesarily advocating to change the PR to this, but just food for thought. |
107668f
to
569ca03
Compare
NaNs are not the only difference: e.g. |
LGTM @StefanKarpinski any chance you also want to take a look since you added these originally? |
base/float.jl
Outdated
|
||
# interpret as sign-magnitude integer | ||
@inline function _fpint(x) | ||
IntT = signed(uinttype(typeof(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that #36526 defined inttype
(just a note)
Will merge in 48 hours sans objections |
Any particular knowledge of why LLVM doesn't already emit this for AMD64? I have a mild concern that this could be more expensive for other platforms, since moving values from fp to int registers could be slow (while this benchmark simply avoids ever having them in fp registers), and that this requires doing 64-bit operations without the benefit of the 64-bit double hardware. |
There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of
The current C implementation already does that using union access (technically undefined behaviour in C).
I'm not entirely convinced this is true since the NaN check always seems to be done in fp registers. a = rand(1000000);
@btime sort($a, lt=@noinline (a,b)->isless(a,b));
# before: 188.884 ms (2 allocations: 7.63 MiB)
# after: 162.796 ms (2 allocations: 7.63 MiB) Alternatively check isless($(Ref(0.))[], $(Ref(0.))[])
# before: 3.399 ns (0 allocations: 0 bytes)
# after: 2.671 ns (0 allocations: 0 bytes) Generated code by .text
vmovq %xmm0, %rax
vmovq %xmm1, %rcx
testq %rax, %rax
sets %dl
setns %sil
cmpq %rcx, %rax
seta %al
setl %cl
andb %dl, %al
andb %sil, %cl
orb %al, %cl
vucomisd %xmm1, %xmm0
setnp %dl
andb %cl, %dl
vucomisd %xmm1, %xmm1
setp %cl
vucomisd %xmm0, %xmm0
setnp %al
andb %cl, %al
orb %dl, %al
retq after: .text
vucomisd %xmm1, %xmm0
jp L56
vmovq %xmm0, %rax
movabsq $9223372036854775807, %rcx # imm = 0x7FFFFFFFFFFFFFFF
movq %rax, %rdx
xorq %rcx, %rdx
testq %rax, %rax
cmovnsq %rax, %rdx
vmovq %xmm1, %rax
xorq %rax, %rcx
testq %rax, %rax
cmovnsq %rax, %rcx
cmpq %rcx, %rdx
setl %al
retq
L56:
vucomisd %xmm0, %xmm0
setnp %al
retq |
I shared some uncertainty about the choice of task benchmarked.
with
|
here are their asms
|
Note that we don't actually call that implementation but emit Lines 1168 to 1186 in c487dd0
fpislt .
See https://llvm.org/docs/LangRef.html#fcmp-instruction for the meaning behind |
I have an alternative.
duplicating the sorting benchmarks
|
Have you tested to make sure this handles all the weird edge cases properly? Specifically negative zeros, nans and infs (not doubting just want to make sure) |
funny you should ask -- I thought so, then I retried corner cases. Arrgh. signbit(NaN) should be false for any normal NaN that Julia generates because Julia only generates one NaN and indeed I was unaware that we introduced the use of the signed NaN to indicate that the NaN was produced by an arithmetic relationship of non-NaN values...
|
Do you mean type punning using unions? AFAIU, that's perfectly fine in C. |
some context: colleagues and I dove deeply into this quite a while ago; we had the benefit of apropos council. We concluded something similar to the flowing inexpert SO quotes.
from stackoverflow |
Thanks, there the int-cast is actually explicit. Why is this duplicated and can we remove the C-implementation then?
Yes, though I should correct myself: accessing a non-initialized union member is not "undefined behaviour" but at most "unspecified behaviour". |
Is this ready for a merge? |
ready from my end |
no objection |
@StefanKarpinski To do a final review and push the button perhaps? |
If I understand this correctly, it looks like our |
I prefer a Julia implementation, but would be fine either way. |
LGTM. Are there any objections / negatives to replacing the runtime intrinsics ? @vtjnash Is this OK with you? |
We should delete the unused code, but yeah, I have no reason to keep the old version of this, since this is apparently faster. |
Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ```
Done. Do we also need to get rid of the line https://github.com/JuliaLang/julia/blob/master/base/compiler/tfuncs.jl#L196 ? |
Yes, I'd suggest just searching for fpislt and deleting those:
|
Merging in a day sans objections |
Many thanks @stev47, sorry this took so long. |
* implement faster floating-point `isless` Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ``` * Remove old intrinsic fpslt code Co-authored-by: Mustafa Mohamad <[email protected]>
* implement faster floating-point `isless` Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ``` * Remove old intrinsic fpslt code Co-authored-by: Mustafa Mohamad <[email protected]>
* implement faster floating-point `isless` Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ``` * Remove old intrinsic fpslt code Co-authored-by: Mustafa Mohamad <[email protected]>
* implement faster floating-point `isless` Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ``` * Remove old intrinsic fpslt code Co-authored-by: Mustafa Mohamad <[email protected]>
Previously
isless
relied on the C intrinsicfpislt
insrc/runtime_intrinsics.c
, while the new implementation in Juliaarguably generates better code, namely:
for branch prediction in arguably most usecases (i.e. comparing
non-NaN floats), thus speeding up execution.
embedding code has already proven the arguments to be non-NaN.
integers instead of case analysis based on the sign of one
argument. This symmetric treatment may generate vectorized
instructions for the sign-magnitude conversion depending on how the
arguments are layed out.
The actual behaviour of
isless
did not change and apart from theJulia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified
totalOrder
.While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous
fpislt
either.Quick benchmarks (AMD A10-7860K) on
sort
, avoiding the specializedalgorithm: