-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
softgpu: Use SIMD more for dot products #17571
Conversation
Small gains add up! |
Is there any particular need for explicit zeroing in __m128 Dot33SSE4(__m128 a, __m128 b) {
__m128 multiplied = _mm_mul_ps(a, b);
__m128 lanes3311 = _mm_movehdup_ps(multiplied);
__m128 partial = _mm_add_ps(multiplied, lanes3311);
return _mm_add_ss(partial, _mm_movehl_ps(partial, multiplied));
} There might well be, the x86 SIMD perf is quirky (and it does seem to introduce an extra move), but I do not see it offhand. Seems slight improvement, when measured on godbolt. The pure SSE2 version __m128 Dot33SSE2(__m128 a,__m128 b)
{
__m128 v = _mm_mul_ps(a, b);
__m128 shuf = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 2, 0, 1));
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, shuf);
return _mm_add_ss(sums, shuf);
} might be a bit slower. Note: code is inspired by https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026 . |
Given that we probably don't match the hardware dot products to the bit level here, we might even be able to use |
Hm, |
Mostly I avoid dpps because I've found it to be slow in the past. Just checked on godbolt and I'm seeing it much slower (just casually replacing Dot33SSE4 with it and trying clang 16 as well.) Maybe depends on what runner it hits, because one time I did see it about the same speed. This latest runner seems to prefer the SSE2 code: dpps 2.555 2.587 2.506 2.563 2.465 Was mostly trying to avoid the awful codegen in Dot(). Feel free to PR one of the other versions. The template stuff was more to convince MSVC to inline and use SSE regs more consistently. There aren't processors without SSE4.1 that will be able to run softgpu at decent performance for most games AFAIK, so I mostly just want to keep it running there and am not worrying about perf on SSE2. But if we can avoid the annoying hoops, all the better. -[Unknown] |
Yes, I also seen I also don't understand the reason for
Well, Valkyria Chronicles II runs at about 8 FPS on my SSE4-less machine in SW mode, so yeah, sounds about right (somewhat usable for debug purposes, but not for normal play). Dithering has its charm though. |
Oh right, forgot that that's why we haven't used dpps so far. |
My understanding is that |
Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.
-[Unknown] |
Had this in a stash from before v1.15.x were released. It's a small gain but it helps when there's a lot of vertex processing.
-[Unknown]