implement faster floating-point `isless` #39090

stev47 · 2021-01-04T18:14:46Z

Previously isless relied on the C intrinsic fpislt in
src/runtime_intrinsics.c, while the new implementation in Julia
arguably generates better code, namely:

The NaN-check compiles to a single instruction + branch amenable
for branch prediction in arguably most usecases (i.e. comparing
non-NaN floats), thus speeding up execution.
The compiler now often manages to remove NaN-computation if the
embedding code has already proven the arguments to be non-NaN.
The actual operation compares both arguments as sign-magnitude
integers instead of case analysis based on the sign of one
argument. This symmetric treatment may generate vectorized
instructions for the sign-magnitude conversion depending on how the
arguments are layed out.

The actual behaviour of isless did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified totalOrder.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous fpislt either.

Quick benchmarks (AMD A10-7860K) on sort, avoiding the specialized
algorithm:

a = rand(1000);
@btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)

oscardssmith · 2021-01-04T18:31:17Z

I really can't think of anything to say other than wow.

ViralBShah · 2021-01-05T03:49:34Z

@stev47 Can you provide the before/after benchmark times? Just for curious onlookers who may not check out the branch and run the script themselves.

stev47 · 2021-01-05T08:04:59Z

@stev47 Can you provide the before/after benchmark times? Just for curious onlookers who may not check out the branch and run the script themselves.

I've updated the benchmarks. The times got stripped out by accident, since git commit removes lines starting with # in the commit message.

musm · 2021-01-06T06:14:14Z

Is the only reason we are not using <, is to satisfy the ordering totalOrder ? Simply piggy backing off of <, but accounting for the ordering, results in close performance to this PR (near identical for arrays without NaNs in most cases). The disadvantage is it's not as fast in cases with mixed NaN's, but the advantage is that it's simpler to implement and requires fewer LoC.

@inline function isless(a::T, b::T) where T<:Union{Float16,Float32,Float64}
    (isnan(a) || isnan(b)) && return !isnan(a)
    return a < b
end

On my computer: Timings of this change in the only case where the timings are significantly different (i.e. arrays containing mixed NaN's):

a = [rand((rand(), NaN)) for _ in 1:1000000];
@btime sort($a, lt=(a,b)->isless(a,b));

# 42% faster  (this PR)
# 31% faster  (using isless with < directly)

Otherwise the changes within 0-3%.

I'm not necesarily advocating to change the PR to this, but just food for thought.

base/float.jl

stev47 · 2021-01-06T10:24:57Z

Is the only reason we are not using <, is to satisfy the ordering totalOrder ? Simply piggy backing off of <, but accounting for the ordering, results in close performance to this PR (near identical for arrays without NaNs in most cases).

NaNs are not the only difference: e.g. isless(-0., 0.) is true, while -0. < 0. is not. Non-canonical encodings are also treated differently.

musm · 2021-01-06T21:40:18Z

LGTM

@StefanKarpinski any chance you also want to take a look since you added these originally?

musm · 2021-01-07T06:23:37Z

base/float.jl

+
+# interpret as sign-magnitude integer
+@inline function _fpint(x)
+    IntT = signed(uinttype(typeof(x)))


I just noticed that #36526 defined inttype (just a note)

musm · 2021-01-10T20:11:06Z

Will merge in 48 hours sans objections

vtjnash · 2021-01-10T21:19:13Z

Any particular knowledge of why LLVM doesn't already emit this for AMD64? I have a mild concern that this could be more expensive for other platforms, since moving values from fp to int registers could be slow (while this benchmark simply avoids ever having them in fp registers), and that this requires doing 64-bit operations without the benefit of the 64-bit double hardware.

stev47 · 2021-01-10T23:20:09Z

Any particular knowledge of why LLVM doesn't already emit this for AMD64?

There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of fpislt in src/runtime_instrinsics.c:838: The equivalence to the signed-integer interpretation is not obvious, also llvm didn't manage to simplify that NaN-check.

since moving values from fp to int registers could be slow

The current C implementation already does that using union access (technically undefined behaviour in C).

while this benchmark simply avoids ever having them in fp registers

I'm not entirely convinced this is true since the NaN check always seems to be done in fp registers.
But even if you force @noinline on the comparator (and thereby force passing float registers to the function) there is a benefit:

a = rand(1000000);
@btime sort($a, lt=@noinline (a,b)->isless(a,b));
# before: 188.884 ms (2 allocations: 7.63 MiB)
# after: 162.796 ms (2 allocations: 7.63 MiB)

Alternatively check

isless($(Ref(0.))[], $(Ref(0.))[])
# before: 3.399 ns (0 allocations: 0 bytes)
# after: 2.671 ns (0 allocations: 0 bytes)

Generated code by @code_native isless(0., 0.), before:

	.text
	vmovq	%xmm0, %rax
	vmovq	%xmm1, %rcx
	testq	%rax, %rax
	sets	%dl
	setns	%sil
	cmpq	%rcx, %rax
	seta	%al
	setl	%cl
	andb	%dl, %al
	andb	%sil, %cl
	orb	%al, %cl
	vucomisd	%xmm1, %xmm0
	setnp	%dl
	andb	%cl, %dl
	vucomisd	%xmm1, %xmm1
	setp	%cl
	vucomisd	%xmm0, %xmm0
	setnp	%al
	andb	%cl, %al
	orb	%dl, %al
	retq

after:

	.text
	vucomisd	%xmm1, %xmm0
	jp	L56
	vmovq	%xmm0, %rax
	movabsq	$9223372036854775807, %rcx # imm = 0x7FFFFFFFFFFFFFFF
	movq	%rax, %rdx
	xorq	%rcx, %rdx
	testq	%rax, %rax
	cmovnsq	%rax, %rdx
	vmovq	%xmm1, %rax
	xorq	%rax, %rcx
	testq	%rax, %rax
	cmovnsq	%rax, %rcx
	cmpq	%rcx, %rdx
	setl	%al
	retq
L56:
	vucomisd	%xmm0, %xmm0
	setnp	%al
	retq

JeffreySarnoff · 2021-01-10T23:33:00Z

I shared some uncertainty about the choice of task benchmarked.
Earlier, I ran a different benchmark (on an Intel Xeon E-2176M CPU).
the take-away ymmv

function test(avec, islessfn)
  result = 0
  for idx in 1:length(avec)-1
    result += islessfn(avec[idx], avec[idx+1])
  end
  return result
end

with

@inline function _fpint1(x::Float64)
   ix = reinterpret(Int64, x)
   return ifelse(ix < zero(Int64), ix ⊻ typemax(Int64), ix)
 end

@inline isless0(a::Float64, b::Float64) = Base.isless(a, b)

@inline function isless1(a::Float64, b::Float64)
   (isnan(a) || isnan(b)) && return !isnan(a)
   return _fpint1(a) < _fpint1(b)
end

@inline function isless2(a::Float64, b::Float64)
    (isnan(a) || isnan(b)) && return !isnan(a)
    signbit(a) && !signbit(b) && return true
    return a < b
end

a10 = rand(10); a100 = rand(100); a1000 = rand(1000); 
a10000 = rand(10000); a100000 = rand(100000);

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10; isless_fn=isless0;);
  35.146 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10; isless_fn=isless1;);
  30.754 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10; isless_fn=isless2;);
  29.246 ns (0 allocations: 0 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100; isless_fn=isless0;);
  243.970 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100; isless_fn=isless1;);
  146.761 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100; isless_fn=isless2;);
  207.833 ns (0 allocations: 0 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a1000; isless_fn=isless0;);
  2.333 μs (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a1000; isless_fn=isless1;);
  1.920 μs (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a1000; isless_fn=isless2;);
  1.490 μs (0 allocations: 0 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10000; isless_fn=isless0;);
  23.200 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10000; isless_fn=isless1;);
  12.800 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10000; isless_fn=isless2;);
  19.100 μs (1 allocation: 16 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100000; isless_fn=isless0;);
  227.800 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100000; isless_fn=isless1;);
  191.800 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100000; isless_fn=isless2;);
  148.700 μs (1 allocation: 16 bytes)

randsign(n) = rand((-1,1), n)
b10 = a10 .* randsign(10);
b100 = a100 .* randsign(100);
b1000 = a1000 .* randsign(1000);
b10000 = a10000 .* randsign(10000);
b100000 = a100000 .* randsign(100000);

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10; isless_fn=isless0;);
  35.045 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10; isless_fn=isless1;);
  26.479 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10; isless_fn=isless2;);
  30.251 ns (0 allocations: 0 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100; isless_fn=isless0;);
  243.584 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100; isless_fn=isless1;);
  207.833 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100; isless_fn=isless2;);
  161.923 ns (0 allocations: 0 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b1000; isless_fn=isless0;);
  2.267 μs (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b1000; isless_fn=isless1;);
  1.290 μs (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b1000; isless_fn=isless2;);
  1.920 μs (0 allocations: 0 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10000; isless_fn=isless0;);
  22.600 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10000; isless_fn=isless1;);
  19.100 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10000; isless_fn=isless2;);
  14.800 μs (1 allocation: 16 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100000; isless_fn=isless0;);
  227.700 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100000; isless_fn=isless1;);
  129.400 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100000; isless_fn=isless2;);
  191.700 μs (1 allocation: 16 bytes)

JeffreySarnoff · 2021-01-10T23:40:36Z

here are their asms

julia> @code_native(isless0(x,y))
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        vmovq   %xmm0, %rax
        vmovq   %xmm1, %rcx
        testq   %rax, %rax
        sets    %dl
        setns   %r8b
        cmpq    %rcx, %rax
        seta    %al
        setl    %cl
        andb    %dl, %al
        andb    %r8b, %cl
        orb     %al, %cl
        vucomisd        %xmm1, %xmm0
        setnp   %dl
        andb    %cl, %dl
        vucomisd        %xmm1, %xmm1
        setp    %cl
        vucomisd        %xmm0, %xmm0
        setnp   %al
        andb    %cl, %al
        orb     %dl, %al
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

julia> @code_native(isless1(x,y))
        .text
        vucomisd        %xmm1, %xmm0
        jp      L62
        pushq   %rbp
        movq    %rsp, %rbp
        vunpcklpd       %xmm1, %xmm0, %xmm0   # xmm0 = xmm0[0],xmm1[0]
        vpcmpeqd        %xmm1, %xmm1, %xmm1
        vpcmpgtq        %xmm1, %xmm0, %xmm1
        movabsq $.rodata.cst16, %rax
        vxorpd  (%rax), %xmm0, %xmm2
        vblendvpd       %xmm1, %xmm0, %xmm2, %xmm0
        vmovq   %xmm0, %rax
        vpextrq $1, %xmm0, %rcx
        cmpq    %rcx, %rax
        setl    %al
        popq    %rbp
        retq
L62:    vucomisd        %xmm0, %xmm0
        setnp   %al
        retq
        nopw    %cs:(%rax,%rax)

julia> @code_native(isless2(x,y))
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        vucomisd        %xmm1, %xmm0
        jp      L43
        vmovq   %xmm0, %rax
        testq   %rax, %rax
        jns     L34
        vmovq   %xmm1, %rax
        testq   %rax, %rax
        js      L34
        movb    $1, %al
        popq    %rbp
        retq
L34:    vucomisd        %xmm0, %xmm1
        seta    %al
        popq    %rbp
        retq
L43:    vucomisd        %xmm0, %xmm0
        setnp   %al
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

vchuravy · 2021-01-11T02:35:50Z

There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of fpislt in src/runtime_instrinsics.c:838: The equivalence to the signed-integer interpretation is not obvious, also llvm didn't manage to simplify that NaN-check.

Note that we don't actually call that implementation but emit

julia/src/intrinsics.cpp

Lines 1168 to 1186 in c487dd0

    
           case fpislt: { 
        
               *newtyp = jl_bool_type; 
        
               Type *it = INTT(t); 
        
               Value *xi = ctx.builder.CreateBitCast(x, it); 
        
               Value *yi = ctx.builder.CreateBitCast(y, it); 
        
               return ctx.builder.CreateOr( 
        
                   ctx.builder.CreateAnd( 
        
                       ctx.builder.CreateFCmpORD(x, x), 
        
                       ctx.builder.CreateFCmpUNO(y, y)), 
        
                   ctx.builder.CreateAnd( 
        
                       ctx.builder.CreateFCmpORD(x, y), 
        
                       ctx.builder.CreateOr( 
        
                           ctx.builder.CreateAnd( 
        
                               ctx.builder.CreateICmpSGE(xi, ConstantInt::get(it, 0)), 
        
                               ctx.builder.CreateICmpSLT(xi, yi)), 
        
                           ctx.builder.CreateAnd( 
        
                               ctx.builder.CreateICmpSLT(xi, ConstantInt::get(it, 0)), 
        
                               ctx.builder.CreateICmpUGT(xi, yi))))); 
        
           }

on fpislt.

See https://llvm.org/docs/LangRef.html#fcmp-instruction for the meaning behind ord, and uno.

JeffreySarnoff · 2021-01-11T06:49:21Z

I have an alternative.

@inline function ieee_isless(x::T, y::T) where {T<:Base.IEEEFloat}
    signbit(x-y) || ifelse(isnan(y), !isnan(x), false)
end

using BenchmarkTools
const BT=BenchmarkTools.DEFAULT_PARAMETERS;
BT.samples=80_000; BT.time_tolerance=0.0005;

@inline function base_isless(x::T, y::T) where {T<:Base.IEEEFloat}
    Base.isless(x, y)
end

@inline function contrib_isless(a::T, b::T) where {T<:Base.IEEEFloat}
    (isnan(a) || isnan(b)) && return !isnan(a)
    return contrib_fpint(a) < contrib_fpint(b)
end
# interpret as sign-magnitude integer
@inline function contrib_fpint(x)
    IntT = Base.signed(Base.uinttype(typeof(x)))
    ix = reinterpret(IntT, x)
    return ifelse(ix < zero(IntT), ix ⊻ typemax(IntT), ix)
end

@inline function ieee_isless(x::T, y::T) where {T<:Base.IEEEFloat}
    signbit(x-y) || ifelse(isnan(y), !isnan(x), false)
end

randsign(n) = rand((-1,1), n)
r10 = rand(10) .* randsign(10);
r1_000 = rand(1_000) .* randsign(1_000);
r100_000 = rand(100_000) .* randsign(100_000);

relspeedup(x, y) = abs(x-y) / x
pctspeedup(x, y) = round(relspeedup(x, y) * 100, digits=1)

function test(avec, islessfn)
    result = 0
    for idx in 1:length(avec)-1
        result += islessfn(avec[idx], avec[idx+1])
    end
    return result
end

base_isless_10 = @belapsed test(avec, base_isless) setup=(avec=r10;);
contrib_isless_10 = @belapsed test(avec, contrib_isless) setup=(avec=r10;);
ieee_isless_10 = @belapsed test(avec, contrib_isless) setup=(avec=r10;);

base_isless_1000 = @belapsed test(avec, base_isless) setup=(avec=r1_000;);
contrib_isless_1000 = @belapsed test(avec, contrib_isless) setup=(avec=r1_000;);
ieee_isless_1000 = @belapsed test(avec, contrib_isless) setup=(avec=r1_000;);

base_isless_100000 = @belapsed test(avec, base_isless) setup=(avec=r100_000;);
contrib_isless_100000 = @belapsed test(avec, contrib_isless) setup=(avec=r100_000;);
ieee_isless_100000 = @belapsed test(avec, contrib_isless) setup=(avec=r100_000;);

pctspeedup_ieee2contrib_10 = pctspeedup(ieee_isless_10, contrib_isless_10 )
2.8
pctspeedup_ieee2contrib_1000 = pctspeedup(ieee_isless_1000, contrib_isless_1000)
0.0
pctspeedup_ieee2contrib_100000 = pctspeedup(ieee_isless_100000, contrib_isless_100000)
13.8

duplicating the sorting benchmarks

@btime sort($r10, lt=(a,b)->base_isless(a,b));
#  110.730 ns (1 allocation: 160 bytes)
@btime sort($r10, lt=(a,b)->contrib_isless(a,b));
#  82.510 ns (1 allocation: 160 bytes)
@btime sort($r10, lt=(a,b)->ieee_isless(a,b));
#  66.904 ns (1 allocation: 160 bytes)

@btime sort($r1_000, lt=(a,b)->base_isless(a,b));
#  30.400 μs (1 allocation: 7.94 KiB)
@btime sort($r1_000, lt=(a,b)->contrib_isless(a,b));
#  28.400 μs (1 allocation: 7.94 KiB)
@btime sort($r1_000, lt=(a,b)->ieee_isless(a,b));
#  21.200 μs (1 allocation: 7.94 KiB)

@btime sort($r100_000, lt=(a,b)->base_isless(a,b));
#  8.654 ms (2 allocations: 781.33 KiB)
@btime sort($r100_000, lt=(a,b)->contrib_isless(a,b));
#  6.828 ms (2 allocations: 781.33 KiB)
@btime sort($r100_000, lt=(a,b)->ieee_isless(a,b));
#  6.198 ms (2 allocations: 781.33 KiB)

oscardssmith · 2021-01-11T07:36:36Z

Have you tested to make sure this handles all the weird edge cases properly? Specifically negative zeros, nans and infs (not doubting just want to make sure)

JeffreySarnoff · 2021-01-11T08:14:22Z

funny you should ask -- I thought so, then I retried corner cases. Arrgh.

signbit(NaN) should be false for any normal NaN that Julia generates because Julia only generates one NaN and indeed signbit(NaN) === false. Or so it used to be.

I was unaware that we introduced the use of the signed NaN to indicate that the NaN was produced by an arithmetic relationship of non-NaN values...

julia> reinterpret(UInt64,(Inf-Inf))
0xfff8000000000000

julia> reinterpret(UInt64,(Inf/Inf))
0xfff8000000000000

julia> reinterpret(UInt64,(0.0/0.0))
0xfff8000000000000

julia> reinterpret(UInt64,(NaN+NaN))
0x7ff8000000000000

julia> reinterpret(UInt64,(NaN/NaN))
0x7ff8000000000000

julia> reinterpret(UInt64,(NaN+0.0))
0x7ff8000000000000

julia> reinterpret(UInt64,(0.0+NaN))
0x7ff8000000000000

KristofferC · 2021-01-11T09:20:41Z

The current C implementation already does that using union access (technically undefined behaviour in C).

Do you mean type punning using unions? AFAIU, that's perfectly fine in C.

JeffreySarnoff · 2021-01-11T09:38:44Z

some context: colleagues and I dove deeply into this quite a while ago; we had the benefit of apropos council. We concluded something similar to the flowing inexpert SO quotes.

The behavior of type punning with union changed from C89 to C99. The behavior in C99 is the same as C11. ... type punning is allowed in C99 / C11. An unspecified value that could be a trap is read when the union members are of different size.

Finally, one of the changes from C90 to C99 was to remove any restriction on accessing one member of a union when the last store was to a different one. The rationale was that the behaviour would then depend on the representations of the values. Since this point is often misunderstood, it might well be worth making it clear in the Standard.

from stackoverflow

stev47 · 2021-01-11T11:24:09Z

Note that we don't actually call that implementation but emit

Thanks, there the int-cast is actually explicit. Why is this duplicated and can we remove the C-implementation then?

Do you mean type punning using unions? AFAIU, that's perfectly fine in C.

Yes, though I should correct myself: accessing a non-initialized union member is not "undefined behaviour" but at most "unspecified behaviour".
Not sure if we should have the discussion here, but reading beyond the most popular stackoverflow answer the standard does not seem to be clear about it at the very least. Also annex J of the standard lists "The values of bytes that correspond to union members other than the one last stored into" as being unspecified behaviour, though strictly speaking this is also a non-normative statement.

ViralBShah · 2021-01-21T19:48:41Z

Is this ready for a merge?

stev47 · 2021-01-22T08:01:47Z

ready from my end

JeffreySarnoff · 2021-01-22T09:17:59Z

no objection

ViralBShah · 2021-01-22T12:55:05Z

@StefanKarpinski To do a final review and push the button perhaps?

musm · 2021-01-22T16:30:40Z

There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of fpislt in src/runtime_instrinsics.c:838: The equivalence to the signed-integer interpretation is not obvious, also llvm didn't manage to simplify that NaN-check.

Note that we don't actually call that implementation but emit

julia/src/intrinsics.cpp

Lines 1168 to 1186 in c487dd0

case fpislt: {

*newtyp = jl_bool_type;

Type *it = INTT(t);

Value *xi = ctx.builder.CreateBitCast(x, it);

Value *yi = ctx.builder.CreateBitCast(y, it);

return ctx.builder.CreateOr(

ctx.builder.CreateAnd(

ctx.builder.CreateFCmpORD(x, x),

ctx.builder.CreateFCmpUNO(y, y)),

ctx.builder.CreateAnd(

ctx.builder.CreateFCmpORD(x, y),

ctx.builder.CreateOr(

ctx.builder.CreateAnd(

ctx.builder.CreateICmpSGE(xi, ConstantInt::get(it, 0)),

ctx.builder.CreateICmpSLT(xi, yi)),

ctx.builder.CreateAnd(

ctx.builder.CreateICmpSLT(xi, ConstantInt::get(it, 0)),

ctx.builder.CreateICmpUGT(xi, yi)))));

}

on fpislt.
See https://llvm.org/docs/LangRef.html#fcmp-instruction for the meaning behind ord, and uno.

If I understand this correctly, it looks like our fpislt intrinsic implementation is doing too much work and that we could optimize the code instead of replacing it as in this PR, is there any benefit to that? Base.isless is the only place that fpislt is used, so it looks like we could either optimize the intrinsic or replace it with Julia code and delete the intrinsic.

stev47 · 2021-01-23T09:56:27Z

I prefer a Julia implementation, but would be fine either way.
The intrinsic fpislt still has value I believe, since it generates branch-free code if someone really wants that. It could be improved in a separate pull-request.

musm · 2021-04-23T19:33:50Z

LGTM. Are there any objections / negatives to replacing the runtime intrinsics ? @vtjnash Is this OK with you?
Otherwise I think we should go ahead and merge this performance improvement.

vtjnash · 2021-04-24T04:30:37Z

We should delete the unused code, but yeah, I have no reason to keep the old version of this, since this is apparently faster.

Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ```

musm · 2021-04-24T17:53:17Z

We should delete the unused code, but yeah, I have no reason to keep the old version of this, since this is apparently faster.

Done. Do we also need to get rid of the line https://github.com/JuliaLang/julia/blob/master/base/compiler/tfuncs.jl#L196 ?

vtjnash · 2021-04-24T17:59:34Z

Yes, I'd suggest just searching for fpislt and deleting those:

$ git grep fpislt
.clang-format:  - fpislt_n
base/compiler/tfuncs.jl:add_tfunc(fpislt, 2, 2, cmp_tfunc, 1)
base/float.jl:isless( x::Float16, y::Float16) = fpislt(x, y)
base/float.jl:isless( x::Float32, y::Float32) = fpislt(x, y)
base/float.jl:isless( x::Float64, y::Float64) = fpislt(x, y)
src/intrinsics.cpp:    float_func[fpislt] = true;
src/intrinsics.cpp:    case fpislt: {
src/intrinsics.h:    ADD_I(fpislt, 2) \
src/julia_internal.h:JL_DLLEXPORT jl_value_t *jl_fpislt(jl_value_t *a, jl_value_t *b);
src/runtime_intrinsics.c:#define fpislt_n(c_type, nbits)                                         \
src/runtime_intrinsics.c:    static inline int fpislt##nbits(c_type a, c_type b) JL_NOTSAFEPOINT \
src/runtime_intrinsics.c:fpislt_n(float, 32)
src/runtime_intrinsics.c:fpislt_n(double, 64)
src/runtime_intrinsics.c:#define fpislt(a, b) \
src/runtime_intrinsics.c:    sizeof(a) == sizeof(float) ? fpislt32(a, b) : fpislt64(a, b)
src/runtime_intrinsics.c:bool_fintrinsic(fpislt,fpislt)

musm · 2021-04-24T20:17:27Z

Merging in a day sans objections

musm · 2021-04-25T14:19:16Z

Many thanks @stev47, sorry this took so long.

* implement faster floating-point `isless` Previously `isless` relied on the C intrinsic `fpislt` in `src/runtime_intrinsics.c`, while the new implementation in Julia arguably generates better code, namely: 1. The NaN-check compiles to a single instruction + branch amenable for branch prediction in arguably most usecases (i.e. comparing non-NaN floats), thus speeding up execution. 2. The compiler now often manages to remove NaN-computation if the embedding code has already proven the arguments to be non-NaN. 3. The actual operation compares both arguments as sign-magnitude integers instead of case analysis based on the sign of one argument. This symmetric treatment may generate vectorized instructions for the sign-magnitude conversion depending on how the arguments are layed out. The actual behaviour of `isless` did not change and apart from the Julia-specific NaN-handling (which may be up for debate) the resulting total order corresponds to the IEEE-754 specified `totalOrder`. While the new implementation no longer generates fully branchless code I did not manage to construct a usecase where this was detrimental: the saved work seems to outweight the potential cost of a branch misprediction in all of my tests with various NaN-polluted data. Also auto-vectorization was not effective on the previous `fpislt` either. Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized algorithm: ```julia a = rand(1000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 56.030 μs (1 allocation: 7.94 KiB) # after: 40.853 μs (1 allocation: 7.94 KiB) a = rand(1000000); @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 159.499 ms (2 allocations: 7.63 MiB) # after: 120.536 ms (2 allocations: 7.63 MiB) a = [rand((rand(), NaN)) for _ in 1:1000000]; @Btime sort($a, lt=(a,b)->isless(a,b)); # before: 111.925 ms (2 allocations: 7.63 MiB) # after: 77.669 ms (2 allocations: 7.63 MiB) ``` * Remove old intrinsic fpslt code Co-authored-by: Mustafa Mohamad <[email protected]>

stev47 requested a review from Keno January 4, 2021 18:15

stev47 added the performance Must go faster label Jan 4, 2021

stev47 force-pushed the feature/fp-isless branch from e8eeba7 to 107668f Compare January 5, 2021 08:00

musm reviewed Jan 6, 2021

View reviewed changes

base/float.jl Outdated Show resolved Hide resolved

musm reviewed Jan 6, 2021

View reviewed changes

base/float.jl Show resolved Hide resolved

musm reviewed Jan 6, 2021

View reviewed changes

base/float.jl Show resolved Hide resolved

stev47 force-pushed the feature/fp-isless branch from 107668f to 569ca03 Compare January 6, 2021 10:24

musm requested a review from StefanKarpinski January 6, 2021 21:39

musm reviewed Jan 7, 2021

View reviewed changes

stev47 and others added 2 commits April 24, 2021 13:38

Remove old fpisless intrinsic code and update to use inttype

a5822e7

musm force-pushed the feature/fp-isless branch from 569ca03 to a5822e7 Compare April 24, 2021 17:51

Remove old intrinsic fpslt code

248c21c

musm merged commit 79920db into JuliaLang:master Apr 25, 2021

vyu mentioned this pull request Aug 20, 2021

Improve BitonicSort performance for sorting floats JuliaArrays/StaticArrays.jl#952

Open

mikmoore mentioned this pull request Jun 4, 2022

Faster minimum/maximum/extrema #45581

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement faster floating-point `isless` #39090

implement faster floating-point `isless` #39090

stev47 commented Jan 4, 2021 •

edited

Loading

oscardssmith commented Jan 4, 2021

ViralBShah commented Jan 5, 2021

stev47 commented Jan 5, 2021

musm commented Jan 6, 2021

stev47 commented Jan 6, 2021

musm commented Jan 6, 2021

musm Jan 7, 2021

musm commented Jan 10, 2021

vtjnash commented Jan 10, 2021

stev47 commented Jan 10, 2021

JeffreySarnoff commented Jan 10, 2021

JeffreySarnoff commented Jan 10, 2021

vchuravy commented Jan 11, 2021

JeffreySarnoff commented Jan 11, 2021

oscardssmith commented Jan 11, 2021

JeffreySarnoff commented Jan 11, 2021

KristofferC commented Jan 11, 2021

JeffreySarnoff commented Jan 11, 2021 •

edited

Loading

stev47 commented Jan 11, 2021

ViralBShah commented Jan 21, 2021

stev47 commented Jan 22, 2021

JeffreySarnoff commented Jan 22, 2021

ViralBShah commented Jan 22, 2021

musm commented Jan 22, 2021

stev47 commented Jan 23, 2021

musm commented Apr 23, 2021

vtjnash commented Apr 24, 2021

musm commented Apr 24, 2021

vtjnash commented Apr 24, 2021

musm commented Apr 24, 2021

musm commented Apr 25, 2021

implement faster floating-point isless #39090

implement faster floating-point isless #39090

Conversation

stev47 commented Jan 4, 2021 • edited Loading

oscardssmith commented Jan 4, 2021

ViralBShah commented Jan 5, 2021

stev47 commented Jan 5, 2021

musm commented Jan 6, 2021

stev47 commented Jan 6, 2021

musm commented Jan 6, 2021

musm Jan 7, 2021

Choose a reason for hiding this comment

musm commented Jan 10, 2021

vtjnash commented Jan 10, 2021

stev47 commented Jan 10, 2021

JeffreySarnoff commented Jan 10, 2021

JeffreySarnoff commented Jan 10, 2021

vchuravy commented Jan 11, 2021

JeffreySarnoff commented Jan 11, 2021

oscardssmith commented Jan 11, 2021

JeffreySarnoff commented Jan 11, 2021

KristofferC commented Jan 11, 2021

JeffreySarnoff commented Jan 11, 2021 • edited Loading

stev47 commented Jan 11, 2021

ViralBShah commented Jan 21, 2021

stev47 commented Jan 22, 2021

JeffreySarnoff commented Jan 22, 2021

ViralBShah commented Jan 22, 2021

musm commented Jan 22, 2021

stev47 commented Jan 23, 2021

musm commented Apr 23, 2021

vtjnash commented Apr 24, 2021

musm commented Apr 24, 2021

vtjnash commented Apr 24, 2021

musm commented Apr 24, 2021

musm commented Apr 25, 2021

implement faster floating-point `isless` #39090

implement faster floating-point `isless` #39090

stev47 commented Jan 4, 2021 •

edited

Loading

JeffreySarnoff commented Jan 11, 2021 •

edited

Loading