-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some speedup with SSE 4.1 #340
base: develop
Are you sure you want to change the base?
Conversation
437892d
to
be7dad3
Compare
Nice one! With the runtime dispatch I think we can target interesting optimization like this. Do you want that I benchmark it on Intel/AMD? I was thinking on working on ARM in the holidays, if I have some time 🙂 |
btw the |
Considering the simde version you proposed, is this speedup obsolete? We could maybe have a runtime dispatcher. |
It's by no means obsolete but it would be desirable to have the cpu dispatcher. From experimenting with the strings effect, I discovered that one can extract great speed benefits from loop unrolling, and more so when coupled with some inlining. (some greater than 4x on SSE, which might be explained by latency effects of memory or individual instructions) |
Sure, I think the simde PR is fine now.
Dec 26, 2020 15:25:18 JP Cimalando <[email protected]>:
…> Considering the simde version you proposed, is this speedup obsolete? We could maybe have a runtime dispatcher.
>
It's by no means obsolete but it would be desirable to have the cpu dispatcher.
From experimenting with the strings effect, I discovered that one can extract great speed benefits from loop unrolling, and more so when coupled with some inlining. (some greater than 4x on SSE, which might be explained by latency effects of memory or individual instructions)
I'd like the same to be experimented with the resampler; but the simde PR should be dealt with first.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub[#340 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ADUFWQMKMWKBL4X47EF4XWDSWXW3NANCNFSM4PRTWWQQ].
[###24x24:true###][Tracking image][https://github.com/notifications/beacon/ADUFWQOQSJHTJIJDULYG4KLSWXW3NA5CNFSM4PRTWWQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFTENZIQ.gif]
|
This speeds up
sample_quality=2
by 15 to 20%, using SSE4.1 dot-product primitive, and avoiding a bit of instruction latency.Just for illustrating, this optimization should be made CPU-dispatched.
Possibly
strings
can benefit from a similar optimization.