Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline exp functions #38726

Closed
wants to merge 2 commits into from
Closed

Conversation

oscardssmith
Copy link
Member

Now that the bulk of my changes to exp have been merged, I want to know if we should consider adding @inline to the code. The upside is faster performance and possible vectorization (although vectorizing requires @simd ivdep for Float64). The disadvantage is probable increased compile times, and possible performance hits in some situations due to spilling code caches. This @inline was removed from the original PR as it added questions to a PR that was already hard to merge due to it's scope, but I think this is probably worth it.

@oscardssmith
Copy link
Member Author

If anyone has exp heavy real world benchmarks, I would love to see results before and after this change. Synthetic benchmarking isn't especially useful here, as the potential downsides of the PR are masked by it.

@oscardssmith
Copy link
Member Author

can this be tagged performance, math, and latency?

@DilumAluthge DilumAluthge added compiler:latency Compiler latency maths Mathematical functions performance Must go faster labels Dec 6, 2020
@KristofferC
Copy link
Member

I personally don't think these should be inlined by default since I think a very small number of people will use @simd ivdep. But it could be structured something like

@inline function exp_inline(x)
    # exp body
end

@noinline exp(x) = exp_inline(x)

and then people can at least opt in to the inline version (with the caveat that maybe it is considered internals 🤷 ).

@oscardssmith
Copy link
Member Author

Float32 doesn't require ivdep. And Float64 shouldn't. The only reason it currently does is that LLVM for some reason isn't able to tell that the table of constants doesn't alias the output.

@musm
Copy link
Contributor

musm commented Dec 6, 2020

I'm also of the opinion that these shouldn't be inlined by default. What's the criteria that a function gets inlined by default? It seems dangerous without having the compiler detect and do this automagically in cases where makes sense and doesn't blow up the code, otherwise we can run into issues like that in #24117 again.

@oscardssmith
Copy link
Member Author

Does @nanosoldier still work? If so, that might be useful to see what the effects are.

@vchuravy
Copy link
Member

vchuravy commented Dec 6, 2020

@nanosoldier runbenchmarks(ALL, vs=":master")

@oscardssmith
Copy link
Member Author

How long does @nanosoldier usually take? 8 hours seems like a lot.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@oscardssmith
Copy link
Member Author

oscardssmith commented Dec 7, 2020

That is far from enlightening. Hyperbolic trig functions look like they might have slightly regressed, but it mainly looks like noise given how many completely unrelated things changed.

@stevengj
Copy link
Member

stevengj commented Dec 7, 2020

For an exp-heavy workload, maybe something from a BEM solver? Panel integrations for BEM matrix assembly for something like 3d Helmholtz equation probably calls exp a lot (for the Green's function ~ exp(iωr)/r); cc @krcools in case they have real-world data.

@oscardssmith
Copy link
Member Author

Another thing to benchmark is small NN with tanh activation layers, as tanh performance is largely bound by exp

@PallHaraldsson
Copy link
Contributor

tanh in Julia doesn't use exp (it however uses it uses expm1, so maybe inline it?).

In general for at least sin, tanh and more you may have a fast path for even just using an identity function. Does it help to split up some functions in two, so that the fast path can get inlined?

[I checked for tanh, and see that is has a Taylor-expansion that seems preferable to exp, but I guess something more advanced than Taylor is used anyway (by expm1), but should it?]

@oscardssmith
Copy link
Member Author

As of https://github.com/JuliaLang/julia/pull/38382/files, tanh uses exp for big numbers and a minimax polynomial for small numbers. The reason this helps is that our current expm1 is pretty slow, so it's worth avoiding where easy.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jan 27, 2021

[Off-topic (for exp, helpful for Julia's tanh?)]

K-TANH: EFFICIENT TANH FOR DEEP LEARNING
https://arxiv.org/abs/1909.07729

K-TanH consists of parameterized low-precision integer operations, such
as, shift and add/subtract (no floating point operation needed) where parameters are stored in very small
look-up tables that can fit in CPU registers. K-TanH can work on various numerical formats, such as,
Float32 and BFloat16. High quality approximations to other activation functions, e.g., Sigmoid, Swish and
GELU, can be derived from K-TanH. Our AVX512 implementation of K-TanH demonstrates >5× speed up over Intel SVML
[..]
E.g., to fit each table in a 512-bit register for Intel AVX512 SIMD instruc-
tions, we use 5-bit indexing (2 LSBs of exponent and 3
MSBs of mantissa) to create 32 entries (32 intervals of the
input magnitude), each holding up to 16 bit integer values.
Our parameter values are 8-bit only, so we can create 64 in-
tervals to achieve more accurate approximation.

@oscardssmith
Copy link
Member Author

I don't think this is especially useful. This paper is about efficient approximations with ~1% error. We are targeting 1.5 ULPs precision which is ~10^-7 error for Float32 or ~10^-16 for Float64. It probably would be a good idea to make a library with a group of "roughly right shape" functions for use in Deep Learning.

@PallHaraldsson
Copy link
Contributor

[Not about exp, only (deep-learning) Tanh]

For some reason Float64 is not mentioned in that paper, while Float32 is (but note the error they show is for BFloat16):
"K-TanH is compatible with various input formats. However, we focus on optimizing BFloat16 inputs only."

Yes, it's an approximation scheme, for sure belongs in non-Base library:
"is superior to existing competitive approximation schemes while achieving state-of-the-art results on a challenging DL workload"

I'm not sure if the error can then be improved (I recall Newton-Raphson doubling the number of correct bits, but I'm getting rusty and maybe it doesn't apply here?).

At least I found the paper intriguing, and also commented as a warning, in case you would be benchmarking thinking Julia's tanh and exp indirectly called, in case Julia's tanh substituted in some library (very likely, at least on GPUs?). Thinking of this comment: "Another thing to benchmark is small NN with tanh activation layers"

@oscardssmith
Copy link
Member Author

Pushing a new version that makes 2 separate versions of the exp functions. The regular ones which are not inlined, and unsafe_exp, unsafe_exp2, unsafe_exp10 that only work for Float32 and Float64 which inline and don't do bounds checks to return Inf or 0.0.

Now that the bulk of my changes to exp have been merged, I want to know if we should consider adding `@inline` to the code. The upside is faster performance and possible vectorization (although vectorizing requires `@simd ivdep` for `Float64`). The disadvantage is probable increased compile times, and possible performance hits in some situations due to spilling code caches. This `@inline` was removed from the original PR as it added questions to a PR that was already hard to merge due to it's scope, but I think this is probably worth it.
@KristofferC
Copy link
Member

Are these really unsafe or are their more akin to the fastmath versions?

@oscardssmith
Copy link
Member Author

Good point. They are more fastmathy. The two differences are lacks of overflow checks (ie returns garbage if result should be Inf), and automatic inlining.

@KristofferC
Copy link
Member

Okay, then I think maybe they should not be called unsafe because that tends to indicate things like memory corruption / undefined behavior, etc on erroneous use.

@vtjnash
Copy link
Member

vtjnash commented Apr 6, 2021

I don't feel I can fully comment on the code rearrangment here. I would recommend doing the approach KristofferC mentioned, with exp(x...) = inline_exp(x...), until we get the syntax feature to write the callsite as @inline exp(...) (assuming we need this now), and calling the internal version fastmath_exp(x...) instead of unsafe (or small_exp or ninf https://llvm.org/docs/LangRef.html#fast-math-flags?)

@oscardssmith
Copy link
Member Author

I think this pr is basically subsumed by the pr I wrote with actual fastmath versions of the exp functions.

@oscardssmith oscardssmith deleted the patch-2 branch December 28, 2021 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants