Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster softmax? #459

Closed
wants to merge 1 commit into from
Closed

Faster softmax? #459

wants to merge 1 commit into from

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Jan 8, 2023

This defines a fast_softmax which uses a low-accuracy fast_exp. It's about 5x faster on CPU.

On a GPU, the low-accuracy exp isn't faster at all. For small arrays, fast_softmax is faster, because it skips the all(isfinite, max_) check & thus avoids synchronisation. Thus FluxML/NNlibCUDA.jl#63 should get all the benefit.

The alternative on CPU is to make an Array specialisation using LoopVectorization. That's not as quick as this fast_exp (about 2x slower for me) but several more digits of precision. This fast_exp is roughly Float16 precision, do we want that?

@ToucheSir
Copy link
Member

It looks like there is a whole subset of literature for fast softmax approximations. I only read through https://arxiv.org/abs/2111.10770v1, but it has a nice list of prior art. Also of interest may be existing CPU-optimized softmax impls like oneDNN.

@mcabbott
Copy link
Member Author

mcabbott commented Jan 8, 2023

Had not looked, but not surprised there's a literature by now! IIRC we dropped the NVidia one as it was slower than NNlib's.

The immediate goal though is to make this part small compared to the matmul & permutations. Going by my times here this gets us from roughly 50% to 10%.

@mcabbott mcabbott closed this Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants