Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement faster floating-point isless #39090

Merged
merged 3 commits into from
Apr 25, 2021
Merged

Conversation

stev47
Copy link
Contributor

@stev47 stev47 commented Jan 4, 2021

Previously isless relied on the C intrinsic fpislt in
src/runtime_intrinsics.c, while the new implementation in Julia
arguably generates better code, namely:

  1. The NaN-check compiles to a single instruction + branch amenable
    for branch prediction in arguably most usecases (i.e. comparing
    non-NaN floats), thus speeding up execution.
  2. The compiler now often manages to remove NaN-computation if the
    embedding code has already proven the arguments to be non-NaN.
  3. The actual operation compares both arguments as sign-magnitude
    integers instead of case analysis based on the sign of one
    argument. This symmetric treatment may generate vectorized
    instructions for the sign-magnitude conversion depending on how the
    arguments are layed out.

The actual behaviour of isless did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified totalOrder.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous fpislt either.

Quick benchmarks (AMD A10-7860K) on sort, avoiding the specialized
algorithm:

a = rand(1000);
@btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)

@stev47 stev47 requested a review from Keno January 4, 2021 18:15
@stev47 stev47 added the performance Must go faster label Jan 4, 2021
@oscardssmith
Copy link
Member

I really can't think of anything to say other than wow.

@ViralBShah
Copy link
Member

@stev47 Can you provide the before/after benchmark times? Just for curious onlookers who may not check out the branch and run the script themselves.

@stev47 stev47 force-pushed the feature/fp-isless branch from e8eeba7 to 107668f Compare January 5, 2021 08:00
@stev47
Copy link
Contributor Author

stev47 commented Jan 5, 2021

@stev47 Can you provide the before/after benchmark times? Just for curious onlookers who may not check out the branch and run the script themselves.

I've updated the benchmarks. The times got stripped out by accident, since git commit removes lines starting with # in the commit message.

@musm
Copy link
Contributor

musm commented Jan 6, 2021

Is the only reason we are not using <, is to satisfy the ordering totalOrder ? Simply piggy backing off of <, but accounting for the ordering, results in close performance to this PR (near identical for arrays without NaNs in most cases). The disadvantage is it's not as fast in cases with mixed NaN's, but the advantage is that it's simpler to implement and requires fewer LoC.

@inline function isless(a::T, b::T) where T<:Union{Float16,Float32,Float64}
    (isnan(a) || isnan(b)) && return !isnan(a)
    return a < b
end

On my computer: Timings of this change in the only case where the timings are significantly different (i.e. arrays containing mixed NaN's):

a = [rand((rand(), NaN)) for _ in 1:1000000];
@btime sort($a, lt=(a,b)->isless(a,b));

# 42% faster  (this PR)
# 31% faster  (using isless with < directly)

Otherwise the changes within 0-3%.

I'm not necesarily advocating to change the PR to this, but just food for thought.

@stev47 stev47 force-pushed the feature/fp-isless branch from 107668f to 569ca03 Compare January 6, 2021 10:24
@stev47
Copy link
Contributor Author

stev47 commented Jan 6, 2021

Is the only reason we are not using <, is to satisfy the ordering totalOrder ? Simply piggy backing off of <, but accounting for the ordering, results in close performance to this PR (near identical for arrays without NaNs in most cases).

NaNs are not the only difference: e.g. isless(-0., 0.) is true, while -0. < 0. is not. Non-canonical encodings are also treated differently.

@musm musm requested a review from StefanKarpinski January 6, 2021 21:39
@musm
Copy link
Contributor

musm commented Jan 6, 2021

LGTM

@StefanKarpinski any chance you also want to take a look since you added these originally?

base/float.jl Outdated

# interpret as sign-magnitude integer
@inline function _fpint(x)
IntT = signed(uinttype(typeof(x)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that #36526 defined inttype (just a note)

@musm
Copy link
Contributor

musm commented Jan 10, 2021

Will merge in 48 hours sans objections

@vtjnash
Copy link
Member

vtjnash commented Jan 10, 2021

Any particular knowledge of why LLVM doesn't already emit this for AMD64? I have a mild concern that this could be more expensive for other platforms, since moving values from fp to int registers could be slow (while this benchmark simply avoids ever having them in fp registers), and that this requires doing 64-bit operations without the benefit of the 64-bit double hardware.

@stev47
Copy link
Contributor Author

stev47 commented Jan 10, 2021

Any particular knowledge of why LLVM doesn't already emit this for AMD64?

There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of fpislt in src/runtime_instrinsics.c:838: The equivalence to the signed-integer interpretation is not obvious, also llvm didn't manage to simplify that NaN-check.

since moving values from fp to int registers could be slow

The current C implementation already does that using union access (technically undefined behaviour in C).

while this benchmark simply avoids ever having them in fp registers

I'm not entirely convinced this is true since the NaN check always seems to be done in fp registers.
But even if you force @noinline on the comparator (and thereby force passing float registers to the function) there is a benefit:

a = rand(1000000);
@btime sort($a, lt=@noinline (a,b)->isless(a,b));
# before: 188.884 ms (2 allocations: 7.63 MiB)
# after: 162.796 ms (2 allocations: 7.63 MiB)

Alternatively check

isless($(Ref(0.))[], $(Ref(0.))[])
# before: 3.399 ns (0 allocations: 0 bytes)
# after: 2.671 ns (0 allocations: 0 bytes)

Generated code by @code_native isless(0., 0.), before:

	.text
	vmovq	%xmm0, %rax
	vmovq	%xmm1, %rcx
	testq	%rax, %rax
	sets	%dl
	setns	%sil
	cmpq	%rcx, %rax
	seta	%al
	setl	%cl
	andb	%dl, %al
	andb	%sil, %cl
	orb	%al, %cl
	vucomisd	%xmm1, %xmm0
	setnp	%dl
	andb	%cl, %dl
	vucomisd	%xmm1, %xmm1
	setp	%cl
	vucomisd	%xmm0, %xmm0
	setnp	%al
	andb	%cl, %al
	orb	%dl, %al
	retq

after:

	.text
	vucomisd	%xmm1, %xmm0
	jp	L56
	vmovq	%xmm0, %rax
	movabsq	$9223372036854775807, %rcx # imm = 0x7FFFFFFFFFFFFFFF
	movq	%rax, %rdx
	xorq	%rcx, %rdx
	testq	%rax, %rax
	cmovnsq	%rax, %rdx
	vmovq	%xmm1, %rax
	xorq	%rax, %rcx
	testq	%rax, %rax
	cmovnsq	%rax, %rcx
	cmpq	%rcx, %rdx
	setl	%al
	retq
L56:
	vucomisd	%xmm0, %xmm0
	setnp	%al
	retq

@JeffreySarnoff
Copy link
Contributor

I shared some uncertainty about the choice of task benchmarked.
Earlier, I ran a different benchmark (on an Intel Xeon E-2176M CPU).
the take-away ymmv

function test(avec, islessfn)
  result = 0
  for idx in 1:length(avec)-1
    result += islessfn(avec[idx], avec[idx+1])
  end
  return result
end

with

@inline function _fpint1(x::Float64)
   ix = reinterpret(Int64, x)
   return ifelse(ix < zero(Int64), ix ⊻ typemax(Int64), ix)
 end

@inline isless0(a::Float64, b::Float64) = Base.isless(a, b)

@inline function isless1(a::Float64, b::Float64)
   (isnan(a) || isnan(b)) && return !isnan(a)
   return _fpint1(a) < _fpint1(b)
end

@inline function isless2(a::Float64, b::Float64)
    (isnan(a) || isnan(b)) && return !isnan(a)
    signbit(a) && !signbit(b) && return true
    return a < b
end
a10 = rand(10); a100 = rand(100); a1000 = rand(1000); 
a10000 = rand(10000); a100000 = rand(100000);

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10; isless_fn=isless0;);
  35.146 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10; isless_fn=isless1;);
  30.754 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10; isless_fn=isless2;);
  29.246 ns (0 allocations: 0 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100; isless_fn=isless0;);
  243.970 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100; isless_fn=isless1;);
  146.761 ns (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100; isless_fn=isless2;);
  207.833 ns (0 allocations: 0 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a1000; isless_fn=isless0;);
  2.333 μs (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a1000; isless_fn=isless1;);
  1.920 μs (0 allocations: 0 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a1000; isless_fn=isless2;);
  1.490 μs (0 allocations: 0 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10000; isless_fn=isless0;);
  23.200 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10000; isless_fn=isless1;);
  12.800 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a10000; isless_fn=isless2;);
  19.100 μs (1 allocation: 16 bytes)

julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100000; isless_fn=isless0;);
  227.800 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100000; isless_fn=isless1;);
  191.800 μs (1 allocation: 16 bytes)
julia> @btime test(a_vec, isless_fn) setup=(a_vec=a100000; isless_fn=isless2;);
  148.700 μs (1 allocation: 16 bytes)
randsign(n) = rand((-1,1), n)
b10 = a10 .* randsign(10);
b100 = a100 .* randsign(100);
b1000 = a1000 .* randsign(1000);
b10000 = a10000 .* randsign(10000);
b100000 = a100000 .* randsign(100000);

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10; isless_fn=isless0;);
  35.045 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10; isless_fn=isless1;);
  26.479 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10; isless_fn=isless2;);
  30.251 ns (0 allocations: 0 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100; isless_fn=isless0;);
  243.584 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100; isless_fn=isless1;);
  207.833 ns (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100; isless_fn=isless2;);
  161.923 ns (0 allocations: 0 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b1000; isless_fn=isless0;);
  2.267 μs (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b1000; isless_fn=isless1;);
  1.290 μs (0 allocations: 0 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b1000; isless_fn=isless2;);
  1.920 μs (0 allocations: 0 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10000; isless_fn=isless0;);
  22.600 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10000; isless_fn=isless1;);
  19.100 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b10000; isless_fn=isless2;);
  14.800 μs (1 allocation: 16 bytes)

julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100000; isless_fn=isless0;);
  227.700 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100000; isless_fn=isless1;);
  129.400 μs (1 allocation: 16 bytes)
julia> @btime test(b_vec, isless_fn) setup=(b_vec=b100000; isless_fn=isless2;);
  191.700 μs (1 allocation: 16 bytes)

@JeffreySarnoff
Copy link
Contributor

here are their asms

julia> @code_native(isless0(x,y))
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        vmovq   %xmm0, %rax
        vmovq   %xmm1, %rcx
        testq   %rax, %rax
        sets    %dl
        setns   %r8b
        cmpq    %rcx, %rax
        seta    %al
        setl    %cl
        andb    %dl, %al
        andb    %r8b, %cl
        orb     %al, %cl
        vucomisd        %xmm1, %xmm0
        setnp   %dl
        andb    %cl, %dl
        vucomisd        %xmm1, %xmm1
        setp    %cl
        vucomisd        %xmm0, %xmm0
        setnp   %al
        andb    %cl, %al
        orb     %dl, %al
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

julia> @code_native(isless1(x,y))
        .text
        vucomisd        %xmm1, %xmm0
        jp      L62
        pushq   %rbp
        movq    %rsp, %rbp
        vunpcklpd       %xmm1, %xmm0, %xmm0   # xmm0 = xmm0[0],xmm1[0]
        vpcmpeqd        %xmm1, %xmm1, %xmm1
        vpcmpgtq        %xmm1, %xmm0, %xmm1
        movabsq $.rodata.cst16, %rax
        vxorpd  (%rax), %xmm0, %xmm2
        vblendvpd       %xmm1, %xmm0, %xmm2, %xmm0
        vmovq   %xmm0, %rax
        vpextrq $1, %xmm0, %rcx
        cmpq    %rcx, %rax
        setl    %al
        popq    %rbp
        retq
L62:    vucomisd        %xmm0, %xmm0
        setnp   %al
        retq
        nopw    %cs:(%rax,%rax)

julia> @code_native(isless2(x,y))
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        vucomisd        %xmm1, %xmm0
        jp      L43
        vmovq   %xmm0, %rax
        testq   %rax, %rax
        jns     L34
        vmovq   %xmm1, %rax
        testq   %rax, %rax
        js      L34
        movb    $1, %al
        popq    %rbp
        retq
L34:    vucomisd        %xmm0, %xmm1
        seta    %al
        popq    %rbp
        retq
L43:    vucomisd        %xmm0, %xmm0
        setnp   %al
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

@vchuravy
Copy link
Member

There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of fpislt in src/runtime_instrinsics.c:838: The equivalence to the signed-integer interpretation is not obvious, also llvm didn't manage to simplify that NaN-check.

Note that we don't actually call that implementation but emit

julia/src/intrinsics.cpp

Lines 1168 to 1186 in c487dd0

case fpislt: {
*newtyp = jl_bool_type;
Type *it = INTT(t);
Value *xi = ctx.builder.CreateBitCast(x, it);
Value *yi = ctx.builder.CreateBitCast(y, it);
return ctx.builder.CreateOr(
ctx.builder.CreateAnd(
ctx.builder.CreateFCmpORD(x, x),
ctx.builder.CreateFCmpUNO(y, y)),
ctx.builder.CreateAnd(
ctx.builder.CreateFCmpORD(x, y),
ctx.builder.CreateOr(
ctx.builder.CreateAnd(
ctx.builder.CreateICmpSGE(xi, ConstantInt::get(it, 0)),
ctx.builder.CreateICmpSLT(xi, yi)),
ctx.builder.CreateAnd(
ctx.builder.CreateICmpSLT(xi, ConstantInt::get(it, 0)),
ctx.builder.CreateICmpUGT(xi, yi)))));
}
on fpislt.

See https://llvm.org/docs/LangRef.html#fcmp-instruction for the meaning behind ord, and uno.

@JeffreySarnoff
Copy link
Contributor

I have an alternative.

@inline function ieee_isless(x::T, y::T) where {T<:Base.IEEEFloat}
    signbit(x-y) || ifelse(isnan(y), !isnan(x), false)
end
using BenchmarkTools
const BT=BenchmarkTools.DEFAULT_PARAMETERS;
BT.samples=80_000; BT.time_tolerance=0.0005;

@inline function base_isless(x::T, y::T) where {T<:Base.IEEEFloat}
    Base.isless(x, y)
end

@inline function contrib_isless(a::T, b::T) where {T<:Base.IEEEFloat}
    (isnan(a) || isnan(b)) && return !isnan(a)
    return contrib_fpint(a) < contrib_fpint(b)
end
# interpret as sign-magnitude integer
@inline function contrib_fpint(x)
    IntT = Base.signed(Base.uinttype(typeof(x)))
    ix = reinterpret(IntT, x)
    return ifelse(ix < zero(IntT), ix ⊻ typemax(IntT), ix)
end

@inline function ieee_isless(x::T, y::T) where {T<:Base.IEEEFloat}
    signbit(x-y) || ifelse(isnan(y), !isnan(x), false)
end

randsign(n) = rand((-1,1), n)
r10 = rand(10) .* randsign(10);
r1_000 = rand(1_000) .* randsign(1_000);
r100_000 = rand(100_000) .* randsign(100_000);

relspeedup(x, y) = abs(x-y) / x
pctspeedup(x, y) = round(relspeedup(x, y) * 100, digits=1)

function test(avec, islessfn)
    result = 0
    for idx in 1:length(avec)-1
        result += islessfn(avec[idx], avec[idx+1])
    end
    return result
end

base_isless_10 = @belapsed test(avec, base_isless) setup=(avec=r10;);
contrib_isless_10 = @belapsed test(avec, contrib_isless) setup=(avec=r10;);
ieee_isless_10 = @belapsed test(avec, contrib_isless) setup=(avec=r10;);

base_isless_1000 = @belapsed test(avec, base_isless) setup=(avec=r1_000;);
contrib_isless_1000 = @belapsed test(avec, contrib_isless) setup=(avec=r1_000;);
ieee_isless_1000 = @belapsed test(avec, contrib_isless) setup=(avec=r1_000;);

base_isless_100000 = @belapsed test(avec, base_isless) setup=(avec=r100_000;);
contrib_isless_100000 = @belapsed test(avec, contrib_isless) setup=(avec=r100_000;);
ieee_isless_100000 = @belapsed test(avec, contrib_isless) setup=(avec=r100_000;);

pctspeedup_ieee2contrib_10 = pctspeedup(ieee_isless_10, contrib_isless_10 )
2.8
pctspeedup_ieee2contrib_1000 = pctspeedup(ieee_isless_1000, contrib_isless_1000)
0.0
pctspeedup_ieee2contrib_100000 = pctspeedup(ieee_isless_100000, contrib_isless_100000)
13.8

duplicating the sorting benchmarks

@btime sort($r10, lt=(a,b)->base_isless(a,b));
#  110.730 ns (1 allocation: 160 bytes)
@btime sort($r10, lt=(a,b)->contrib_isless(a,b));
#  82.510 ns (1 allocation: 160 bytes)
@btime sort($r10, lt=(a,b)->ieee_isless(a,b));
#  66.904 ns (1 allocation: 160 bytes)

@btime sort($r1_000, lt=(a,b)->base_isless(a,b));
#  30.400 μs (1 allocation: 7.94 KiB)
@btime sort($r1_000, lt=(a,b)->contrib_isless(a,b));
#  28.400 μs (1 allocation: 7.94 KiB)
@btime sort($r1_000, lt=(a,b)->ieee_isless(a,b));
#  21.200 μs (1 allocation: 7.94 KiB)

@btime sort($r100_000, lt=(a,b)->base_isless(a,b));
#  8.654 ms (2 allocations: 781.33 KiB)
@btime sort($r100_000, lt=(a,b)->contrib_isless(a,b));
#  6.828 ms (2 allocations: 781.33 KiB)
@btime sort($r100_000, lt=(a,b)->ieee_isless(a,b));
#  6.198 ms (2 allocations: 781.33 KiB)

@oscardssmith
Copy link
Member

Have you tested to make sure this handles all the weird edge cases properly? Specifically negative zeros, nans and infs (not doubting just want to make sure)

@JeffreySarnoff
Copy link
Contributor

funny you should ask -- I thought so, then I retried corner cases. Arrgh.

signbit(NaN) should be false for any normal NaN that Julia generates because Julia only generates one NaN and indeed signbit(NaN) === false. Or so it used to be.

I was unaware that we introduced the use of the signed NaN to indicate that the NaN was produced by an arithmetic relationship of non-NaN values...

julia> reinterpret(UInt64,(Inf-Inf))
0xfff8000000000000

julia> reinterpret(UInt64,(Inf/Inf))
0xfff8000000000000

julia> reinterpret(UInt64,(0.0/0.0))
0xfff8000000000000

julia> reinterpret(UInt64,(NaN+NaN))
0x7ff8000000000000

julia> reinterpret(UInt64,(NaN/NaN))
0x7ff8000000000000

julia> reinterpret(UInt64,(NaN+0.0))
0x7ff8000000000000

julia> reinterpret(UInt64,(0.0+NaN))
0x7ff8000000000000

@KristofferC
Copy link
Member

The current C implementation already does that using union access (technically undefined behaviour in C).

Do you mean type punning using unions? AFAIU, that's perfectly fine in C.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Jan 11, 2021

some context: colleagues and I dove deeply into this quite a while ago; we had the benefit of apropos council. We concluded something similar to the flowing inexpert SO quotes.

The behavior of type punning with union changed from C89 to C99. The behavior in C99 is the same as C11. ... type punning is allowed in C99 / C11. An unspecified value that could be a trap is read when the union members are of different size.

Finally, one of the changes from C90 to C99 was to remove any restriction on accessing one member of a union when the last store was to a different one. The rationale was that the behaviour would then depend on the representations of the values. Since this point is often misunderstood, it might well be worth making it clear in the Standard.

from stackoverflow

@stev47
Copy link
Contributor Author

stev47 commented Jan 11, 2021

Note that we don't actually call that implementation but emit

Thanks, there the int-cast is actually explicit. Why is this duplicated and can we remove the C-implementation then?

Do you mean type punning using unions? AFAIU, that's perfectly fine in C.

Yes, though I should correct myself: accessing a non-initialized union member is not "undefined behaviour" but at most "unspecified behaviour".
Not sure if we should have the discussion here, but reading beyond the most popular stackoverflow answer the standard does not seem to be clear about it at the very least. Also annex J of the standard lists "The values of bytes that correspond to union members other than the one last stored into" as being unspecified behaviour, though strictly speaking this is also a non-normative statement.

@ViralBShah
Copy link
Member

Is this ready for a merge?

@stev47
Copy link
Contributor Author

stev47 commented Jan 22, 2021

ready from my end

@JeffreySarnoff
Copy link
Contributor

no objection

@ViralBShah
Copy link
Member

@StefanKarpinski To do a final review and push the button perhaps?

@musm
Copy link
Contributor

musm commented Jan 22, 2021

There shouldn't be anything special about amd64 here, llvm just doesn't seem to generate efficient code for the current suboptimal C implementation of fpislt in src/runtime_instrinsics.c:838: The equivalence to the signed-integer interpretation is not obvious, also llvm didn't manage to simplify that NaN-check.

Note that we don't actually call that implementation but emit

julia/src/intrinsics.cpp

Lines 1168 to 1186 in c487dd0

case fpislt: {
*newtyp = jl_bool_type;
Type *it = INTT(t);
Value *xi = ctx.builder.CreateBitCast(x, it);
Value *yi = ctx.builder.CreateBitCast(y, it);
return ctx.builder.CreateOr(
ctx.builder.CreateAnd(
ctx.builder.CreateFCmpORD(x, x),
ctx.builder.CreateFCmpUNO(y, y)),
ctx.builder.CreateAnd(
ctx.builder.CreateFCmpORD(x, y),
ctx.builder.CreateOr(
ctx.builder.CreateAnd(
ctx.builder.CreateICmpSGE(xi, ConstantInt::get(it, 0)),
ctx.builder.CreateICmpSLT(xi, yi)),
ctx.builder.CreateAnd(
ctx.builder.CreateICmpSLT(xi, ConstantInt::get(it, 0)),
ctx.builder.CreateICmpUGT(xi, yi)))));
}

on fpislt.
See https://llvm.org/docs/LangRef.html#fcmp-instruction for the meaning behind ord, and uno.

If I understand this correctly, it looks like our fpislt intrinsic implementation is doing too much work and that we could optimize the code instead of replacing it as in this PR, is there any benefit to that? Base.isless is the only place that fpislt is used, so it looks like we could either optimize the intrinsic or replace it with Julia code and delete the intrinsic.

@stev47
Copy link
Contributor Author

stev47 commented Jan 23, 2021

I prefer a Julia implementation, but would be fine either way.
The intrinsic fpislt still has value I believe, since it generates branch-free code if someone really wants that. It could be improved in a separate pull-request.

@musm
Copy link
Contributor

musm commented Apr 23, 2021

LGTM. Are there any objections / negatives to replacing the runtime intrinsics ? @vtjnash Is this OK with you?
Otherwise I think we should go ahead and merge this performance improvement.

@vtjnash
Copy link
Member

vtjnash commented Apr 24, 2021

We should delete the unused code, but yeah, I have no reason to keep the old version of this, since this is apparently faster.

stev47 and others added 2 commits April 24, 2021 13:38
Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:

 1. The NaN-check compiles to a single instruction + branch amenable
    for branch prediction in arguably most usecases (i.e. comparing
    non-NaN floats), thus speeding up execution.
 2. The compiler now often manages to remove NaN-computation if the
    embedding code has already proven the arguments to be non-NaN.
 3. The actual operation compares both arguments as sign-magnitude
    integers instead of case analysis based on the sign of one
    argument. This symmetric treatment may generate vectorized
    instructions for the sign-magnitude conversion depending on how the
    arguments are layed out.

The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.

Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:

```julia
a = rand(1000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)
```
@musm musm force-pushed the feature/fp-isless branch from 569ca03 to a5822e7 Compare April 24, 2021 17:51
@musm
Copy link
Contributor

musm commented Apr 24, 2021

We should delete the unused code, but yeah, I have no reason to keep the old version of this, since this is apparently faster.

Done. Do we also need to get rid of the line https://github.com/JuliaLang/julia/blob/master/base/compiler/tfuncs.jl#L196 ?

@vtjnash
Copy link
Member

vtjnash commented Apr 24, 2021

Yes, I'd suggest just searching for fpislt and deleting those:

$ git grep fpislt
.clang-format:  - fpislt_n
base/compiler/tfuncs.jl:add_tfunc(fpislt, 2, 2, cmp_tfunc, 1)
base/float.jl:isless( x::Float16, y::Float16) = fpislt(x, y)
base/float.jl:isless( x::Float32, y::Float32) = fpislt(x, y)
base/float.jl:isless( x::Float64, y::Float64) = fpislt(x, y)
src/intrinsics.cpp:    float_func[fpislt] = true;
src/intrinsics.cpp:    case fpislt: {
src/intrinsics.h:    ADD_I(fpislt, 2) \
src/julia_internal.h:JL_DLLEXPORT jl_value_t *jl_fpislt(jl_value_t *a, jl_value_t *b);
src/runtime_intrinsics.c:#define fpislt_n(c_type, nbits)                                         \
src/runtime_intrinsics.c:    static inline int fpislt##nbits(c_type a, c_type b) JL_NOTSAFEPOINT \
src/runtime_intrinsics.c:fpislt_n(float, 32)
src/runtime_intrinsics.c:fpislt_n(double, 64)
src/runtime_intrinsics.c:#define fpislt(a, b) \
src/runtime_intrinsics.c:    sizeof(a) == sizeof(float) ? fpislt32(a, b) : fpislt64(a, b)
src/runtime_intrinsics.c:bool_fintrinsic(fpislt,fpislt)

@musm
Copy link
Contributor

musm commented Apr 24, 2021

Merging in a day sans objections

@musm
Copy link
Contributor

musm commented Apr 25, 2021

Many thanks @stev47, sorry this took so long.

@musm musm merged commit 79920db into JuliaLang:master Apr 25, 2021
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
* implement faster floating-point `isless`

Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:

 1. The NaN-check compiles to a single instruction + branch amenable
    for branch prediction in arguably most usecases (i.e. comparing
    non-NaN floats), thus speeding up execution.
 2. The compiler now often manages to remove NaN-computation if the
    embedding code has already proven the arguments to be non-NaN.
 3. The actual operation compares both arguments as sign-magnitude
    integers instead of case analysis based on the sign of one
    argument. This symmetric treatment may generate vectorized
    instructions for the sign-magnitude conversion depending on how the
    arguments are layed out.

The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.

Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:

```julia
a = rand(1000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)
```


* Remove old intrinsic fpslt code

Co-authored-by: Mustafa Mohamad <[email protected]>
jarlebring pushed a commit to jarlebring/julia that referenced this pull request May 4, 2021
* implement faster floating-point `isless`

Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:

 1. The NaN-check compiles to a single instruction + branch amenable
    for branch prediction in arguably most usecases (i.e. comparing
    non-NaN floats), thus speeding up execution.
 2. The compiler now often manages to remove NaN-computation if the
    embedding code has already proven the arguments to be non-NaN.
 3. The actual operation compares both arguments as sign-magnitude
    integers instead of case analysis based on the sign of one
    argument. This symmetric treatment may generate vectorized
    instructions for the sign-magnitude conversion depending on how the
    arguments are layed out.

The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.

Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:

```julia
a = rand(1000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)
```


* Remove old intrinsic fpslt code

Co-authored-by: Mustafa Mohamad <[email protected]>
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request May 9, 2021
* implement faster floating-point `isless`

Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:

 1. The NaN-check compiles to a single instruction + branch amenable
    for branch prediction in arguably most usecases (i.e. comparing
    non-NaN floats), thus speeding up execution.
 2. The compiler now often manages to remove NaN-computation if the
    embedding code has already proven the arguments to be non-NaN.
 3. The actual operation compares both arguments as sign-magnitude
    integers instead of case analysis based on the sign of one
    argument. This symmetric treatment may generate vectorized
    instructions for the sign-magnitude conversion depending on how the
    arguments are layed out.

The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.

Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:

```julia
a = rand(1000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)
```


* Remove old intrinsic fpslt code

Co-authored-by: Mustafa Mohamad <[email protected]>
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
* implement faster floating-point `isless`

Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:

 1. The NaN-check compiles to a single instruction + branch amenable
    for branch prediction in arguably most usecases (i.e. comparing
    non-NaN floats), thus speeding up execution.
 2. The compiler now often manages to remove NaN-computation if the
    embedding code has already proven the arguments to be non-NaN.
 3. The actual operation compares both arguments as sign-magnitude
    integers instead of case analysis based on the sign of one
    argument. This symmetric treatment may generate vectorized
    instructions for the sign-magnitude conversion depending on how the
    arguments are layed out.

The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.

While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.

Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:

```julia
a = rand(1000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 56.030 μs (1 allocation: 7.94 KiB)
    #  after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 159.499 ms (2 allocations: 7.63 MiB)
    #  after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@Btime sort($a, lt=(a,b)->isless(a,b));
    # before: 111.925 ms (2 allocations: 7.63 MiB)
    #  after:  77.669 ms (2 allocations: 7.63 MiB)
```


* Remove old intrinsic fpslt code

Co-authored-by: Mustafa Mohamad <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants