inline exp functions #38726

oscardssmith · 2020-12-05T23:37:48Z

Now that the bulk of my changes to exp have been merged, I want to know if we should consider adding @inline to the code. The upside is faster performance and possible vectorization (although vectorizing requires @simd ivdep for Float64). The disadvantage is probable increased compile times, and possible performance hits in some situations due to spilling code caches. This @inline was removed from the original PR as it added questions to a PR that was already hard to merge due to it's scope, but I think this is probably worth it.

oscardssmith · 2020-12-05T23:39:40Z

If anyone has exp heavy real world benchmarks, I would love to see results before and after this change. Synthetic benchmarking isn't especially useful here, as the potential downsides of the PR are masked by it.

oscardssmith · 2020-12-06T06:39:33Z

can this be tagged performance, math, and latency?

KristofferC · 2020-12-06T09:13:20Z

I personally don't think these should be inlined by default since I think a very small number of people will use @simd ivdep. But it could be structured something like

@inline function exp_inline(x)
    # exp body
end

@noinline exp(x) = exp_inline(x)

and then people can at least opt in to the inline version (with the caveat that maybe it is considered internals 🤷 ).

oscardssmith · 2020-12-06T15:13:00Z

Float32 doesn't require ivdep. And Float64 shouldn't. The only reason it currently does is that LLVM for some reason isn't able to tell that the table of constants doesn't alias the output.

musm · 2020-12-06T21:42:36Z

I'm also of the opinion that these shouldn't be inlined by default. What's the criteria that a function gets inlined by default? It seems dangerous without having the compiler detect and do this automagically in cases where makes sense and doesn't blow up the code, otherwise we can run into issues like that in #24117 again.

oscardssmith · 2020-12-06T21:55:30Z

Does @nanosoldier still work? If so, that might be useful to see what the effects are.

vchuravy · 2020-12-06T22:16:28Z

@nanosoldier runbenchmarks(ALL, vs=":master")

oscardssmith · 2020-12-07T06:39:39Z

How long does @nanosoldier usually take? 8 hours seems like a lot.

nanosoldier · 2020-12-07T07:01:13Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

oscardssmith · 2020-12-07T07:13:24Z

That is far from enlightening. Hyperbolic trig functions look like they might have slightly regressed, but it mainly looks like noise given how many completely unrelated things changed.

stevengj · 2020-12-07T20:59:49Z

For an exp-heavy workload, maybe something from a BEM solver? Panel integrations for BEM matrix assembly for something like 3d Helmholtz equation probably calls exp a lot (for the Green's function ~ exp(iωr)/r); cc @krcools in case they have real-world data.

oscardssmith · 2020-12-07T21:24:36Z

Another thing to benchmark is small NN with tanh activation layers, as tanh performance is largely bound by exp

PallHaraldsson · 2021-01-27T16:15:01Z

tanh in Julia doesn't use exp (it however uses it uses expm1, so maybe inline it?).

In general for at least sin, tanh and more you may have a fast path for even just using an identity function. Does it help to split up some functions in two, so that the fast path can get inlined?

[I checked for tanh, and see that is has a Taylor-expansion that seems preferable to exp, but I guess something more advanced than Taylor is used anyway (by expm1), but should it?]

oscardssmith · 2021-01-27T16:21:59Z

As of https://github.com/JuliaLang/julia/pull/38382/files, tanh uses exp for big numbers and a minimax polynomial for small numbers. The reason this helps is that our current expm1 is pretty slow, so it's worth avoiding where easy.

PallHaraldsson · 2021-01-27T16:38:53Z

[Off-topic (for exp, helpful for Julia's tanh?)]

K-TANH: EFFICIENT TANH FOR DEEP LEARNING
https://arxiv.org/abs/1909.07729

K-TanH consists of parameterized low-precision integer operations, such
as, shift and add/subtract (no floating point operation needed) where parameters are stored in very small
look-up tables that can fit in CPU registers. K-TanH can work on various numerical formats, such as,
Float32 and BFloat16. High quality approximations to other activation functions, e.g., Sigmoid, Swish and
GELU, can be derived from K-TanH. Our AVX512 implementation of K-TanH demonstrates >5× speed up over Intel SVML
[..]
E.g., to fit each table in a 512-bit register for Intel AVX512 SIMD instruc-
tions, we use 5-bit indexing (2 LSBs of exponent and 3
MSBs of mantissa) to create 32 entries (32 intervals of the
input magnitude), each holding up to 16 bit integer values.
Our parameter values are 8-bit only, so we can create 64 in-
tervals to achieve more accurate approximation.

oscardssmith · 2021-01-27T16:45:57Z

I don't think this is especially useful. This paper is about efficient approximations with ~1% error. We are targeting 1.5 ULPs precision which is ~10^-7 error for Float32 or ~10^-16 for Float64. It probably would be a good idea to make a library with a group of "roughly right shape" functions for use in Deep Learning.

PallHaraldsson · 2021-01-27T17:05:34Z

[Not about exp, only (deep-learning) Tanh]

For some reason Float64 is not mentioned in that paper, while Float32 is (but note the error they show is for BFloat16):
"K-TanH is compatible with various input formats. However, we focus on optimizing BFloat16 inputs only."

Yes, it's an approximation scheme, for sure belongs in non-Base library:
"is superior to existing competitive approximation schemes while achieving state-of-the-art results on a challenging DL workload"

I'm not sure if the error can then be improved (I recall Newton-Raphson doubling the number of correct bits, but I'm getting rusty and maybe it doesn't apply here?).

At least I found the paper intriguing, and also commented as a warning, in case you would be benchmarking thinking Julia's tanh and exp indirectly called, in case Julia's tanh substituted in some library (very likely, at least on GPUs?). Thinking of this comment: "Another thing to benchmark is small NN with tanh activation layers"

oscardssmith · 2021-01-27T22:29:55Z

Pushing a new version that makes 2 separate versions of the exp functions. The regular ones which are not inlined, and unsafe_exp, unsafe_exp2, unsafe_exp10 that only work for Float32 and Float64 which inline and don't do bounds checks to return Inf or 0.0.

Now that the bulk of my changes to exp have been merged, I want to know if we should consider adding `@inline` to the code. The upside is faster performance and possible vectorization (although vectorizing requires `@simd ivdep` for `Float64`). The disadvantage is probable increased compile times, and possible performance hits in some situations due to spilling code caches. This `@inline` was removed from the original PR as it added questions to a PR that was already hard to merge due to it's scope, but I think this is probably worth it.

KristofferC · 2021-01-28T12:34:17Z

Are these really unsafe or are their more akin to the fastmath versions?

oscardssmith · 2021-01-28T23:40:52Z

Good point. They are more fastmathy. The two differences are lacks of overflow checks (ie returns garbage if result should be Inf), and automatic inlining.

KristofferC · 2021-01-29T07:59:46Z

Okay, then I think maybe they should not be called unsafe because that tends to indicate things like memory corruption / undefined behavior, etc on erroneous use.

vtjnash · 2021-04-06T18:28:28Z

I don't feel I can fully comment on the code rearrangment here. I would recommend doing the approach KristofferC mentioned, with exp(x...) = inline_exp(x...), until we get the syntax feature to write the callsite as @inline exp(...) (assuming we need this now), and calling the internal version fastmath_exp(x...) instead of unsafe (or small_exp or ninf https://llvm.org/docs/LangRef.html#fast-math-flags?)

oscardssmith · 2021-04-06T18:39:52Z

I think this pr is basically subsumed by the pr I wrote with actual fastmath versions of the exp functions.

DilumAluthge added compiler:latency Compiler latency maths Mathematical functions performance Must go faster labels Dec 6, 2020

oscardssmith added 2 commits January 27, 2021 16:49

add inlined unsafe_exp that doesn't do bounds checks

88d7f74

oscardssmith force-pushed the patch-2 branch from 8c81c28 to 88d7f74 Compare January 27, 2021 23:08

oscardssmith closed this Apr 6, 2021

oscardssmith deleted the patch-2 branch December 28, 2021 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline exp functions #38726

inline exp functions #38726

oscardssmith commented Dec 5, 2020

oscardssmith commented Dec 5, 2020

oscardssmith commented Dec 6, 2020

KristofferC commented Dec 6, 2020

oscardssmith commented Dec 6, 2020

musm commented Dec 6, 2020

oscardssmith commented Dec 6, 2020

vchuravy commented Dec 6, 2020

oscardssmith commented Dec 7, 2020

nanosoldier commented Dec 7, 2020

oscardssmith commented Dec 7, 2020 •

edited

Loading

stevengj commented Dec 7, 2020

oscardssmith commented Dec 7, 2020

PallHaraldsson commented Jan 27, 2021

oscardssmith commented Jan 27, 2021

PallHaraldsson commented Jan 27, 2021 •

edited

Loading

oscardssmith commented Jan 27, 2021

PallHaraldsson commented Jan 27, 2021

oscardssmith commented Jan 27, 2021

KristofferC commented Jan 28, 2021

oscardssmith commented Jan 28, 2021

KristofferC commented Jan 29, 2021

vtjnash commented Apr 6, 2021

oscardssmith commented Apr 6, 2021

inline exp functions #38726

inline exp functions #38726

Conversation

oscardssmith commented Dec 5, 2020

oscardssmith commented Dec 5, 2020

oscardssmith commented Dec 6, 2020

KristofferC commented Dec 6, 2020

oscardssmith commented Dec 6, 2020

musm commented Dec 6, 2020

oscardssmith commented Dec 6, 2020

vchuravy commented Dec 6, 2020

oscardssmith commented Dec 7, 2020

nanosoldier commented Dec 7, 2020

oscardssmith commented Dec 7, 2020 • edited Loading

stevengj commented Dec 7, 2020

oscardssmith commented Dec 7, 2020

PallHaraldsson commented Jan 27, 2021

oscardssmith commented Jan 27, 2021

PallHaraldsson commented Jan 27, 2021 • edited Loading

oscardssmith commented Jan 27, 2021

PallHaraldsson commented Jan 27, 2021

oscardssmith commented Jan 27, 2021

KristofferC commented Jan 28, 2021

oscardssmith commented Jan 28, 2021

KristofferC commented Jan 29, 2021

vtjnash commented Apr 6, 2021

oscardssmith commented Apr 6, 2021

oscardssmith commented Dec 7, 2020 •

edited

Loading

PallHaraldsson commented Jan 27, 2021 •

edited

Loading