Faster `softmax`? #459

mcabbott · 2023-01-08T16:20:27Z

This defines a fast_softmax which uses a low-accuracy fast_exp. It's about 5x faster on CPU.

On a GPU, the low-accuracy exp isn't faster at all. For small arrays, fast_softmax is faster, because it skips the all(isfinite, max_) check & thus avoids synchronisation. Thus FluxML/NNlibCUDA.jl#63 should get all the benefit.

The alternative on CPU is to make an Array specialisation using LoopVectorization. That's not as quick as this fast_exp (about 2x slower for me) but several more digits of precision. This fast_exp is roughly Float16 precision, do we want that?

ToucheSir · 2023-01-08T19:50:19Z

It looks like there is a whole subset of literature for fast softmax approximations. I only read through https://arxiv.org/abs/2111.10770v1, but it has a nice list of prior art. Also of interest may be existing CPU-optimized softmax impls like oneDNN.

mcabbott · 2023-01-08T20:07:49Z

Had not looked, but not surprised there's a literature by now! IIRC we dropped the NVidia one as it was slower than NNlib's.

The immediate goal though is to make this part small compared to the matmul & permutations. Going by my times here this gets us from roughly 50% to 10%.

add fast_softmax via low-precision fast_exp

c380866

mcabbott closed this Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `softmax`? #459

Faster `softmax`? #459

mcabbott commented Jan 8, 2023

ToucheSir commented Jan 8, 2023

mcabbott commented Jan 8, 2023

Faster softmax? #459

Faster softmax? #459

Conversation

mcabbott commented Jan 8, 2023

ToucheSir commented Jan 8, 2023

mcabbott commented Jan 8, 2023

Faster `softmax`? #459

Faster `softmax`? #459