Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Dot33 to SSE2 #17584

Merged
merged 1 commit into from
Jun 15, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 10 additions & 20 deletions GPU/Software/Lighting.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -255,23 +255,13 @@ static inline void LightColorSum(Vec4<int> &sum, const Vec4<int> &src) {
#endif
}

#if defined(_M_SSE)
#if defined(__GNUC__) || defined(__clang__) || defined(__INTEL_COMPILER)
[[gnu::target("sse4.1")]]
#endif
static inline __m128 Dot33SSE4(__m128 a, __m128 b) {
__m128 multiplied = _mm_insert_ps(_mm_mul_ps(a, b), _mm_setzero_ps(), 0x30);
__m128 lanes3311 = _mm_movehdup_ps(multiplied);
__m128 partial = _mm_add_ps(multiplied, lanes3311);
return _mm_add_ss(partial, _mm_movehl_ps(lanes3311, partial));
}
#endif

template <bool useSSE4>
static inline float Dot33(const Vec3f &a, const Vec3f &b) {
#if defined(_M_SSE) && !PPSSPP_ARCH(X86)
if (useSSE4)
return _mm_cvtss_f32(Dot33SSE4(a.vec, b.vec));
#if defined(_M_SSE)
__m128 v = _mm_mul_ps(a.vec, b.vec); // [X, Y, Z, W]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might crash MSVC builds on x86, especially Debug. Not really a huge issue, but vec is not guaranteed to be aligned if passed on the stack (the stack is only aligned to 16 on x86_64.) I guess we could add a loadu and hope the compiler optimizes it out, but otherwise our "solve" has been to avoid intrinsics for x86_32 when they access XMM args directly.

-[Unknown]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

@hrydgard hrydgard Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forgot about it too when merging this. Funnily enough it might not actually even be noticable on recent CPUs since Intel relaxed the alignment requirement for SSE memory operands a while back.. but will affect old ones - which are the more likely ones to run in 32-bit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert just the ifdef change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel relaxed the alignment requirement for SSE memory operands a while back.. but will affect old ones - which are the more likely ones to run in 32-bit.

I could be wrong, as I've heard other people say this same thing recently. But I was pretty sure this was ONLY when a VEX prefix is used (i.e. encoded as AVX/AVX2, regardless of using YMMs or not), which none of the non-jit code in PPSSPP is, at least on Windows. At least that's the case on Coffee Lake for sure, which isn't that old.

-[Unknown]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, ok. I guess I remember that one wrong then - I've avoided those accesses anyway whenever I can of course..

Copy link
Contributor Author

@fp64 fp64 Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could add a loadu and hope the compiler optimizes it out

I'll just mention that movups is as fast as movaps for aligned data starting with Nehalem (Core i7). In fact, compilers often use movups even when implementing _mm_load_ps or equivalent (copying, dereferencing). I think the guarantee is even stronger: you would need to cross a cache line boundary to get a slowdown.
See e.g.:
https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions
https://stackoverflow.com/questions/42697118/visual-studio-2017-mm-load-ps-often-compiled-to-movups

Adding _mm_loadu_ps can also change addps reg,mem into movups reg,mem + addps reg,reg, which... might not actually be slower. MSVC (and GCC) seems rather decent at eliminating movups reg,reg, from the simple examples I tried.

So, 'pepper everything with _mm_loadu_ps' approach doesn't seem to have significant drawbacks, other than visual litter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely compilers often optimize out loadu, but I have trust issues.

For example, here. On x86_64, ideally, I don't want this on the stack/memory, I want it from a register. An example is Dot33<useSSE4>(H.NormalizedOr001(useSSE4), worldnormal). Realistically, worldnormal is probably spilled, but at least the normalized H (and H itself) could've been XMMs directly. I've seen where a loadu convinces MSVC that it ought to spill the normalized H (although to aligned) just to load it later. I mean, the code DID say to load it from memory, so I guess it's not "wrong".

But some of that may have improved. I just feel like every time I look at a hot func that uses SIMD which MSVC is optimizing poorly, it's because it spills things it should NOT spill like crazy, or even unvectorizes things nonsensically. So I don't want to give it excuses to do that to me.

-[Unknown]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

I suppose it's possible to

#if PPSSPP_ARCH(X86) // Make sure MSVC loads aren't borked.
#define MAGIC_LOAD(v) _mm_loadu_ps(reinterpret_cast<const float*>(&(v)))
#else // Works on non-x86 too!
#define MAGIC_LOAD(v) (v)
#endif

assuming we care enough.

__m128 shuf = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 2, 0, 1)); // [Y, X, Z, W]
__m128 sums = _mm_add_ps(v, shuf); // [X + Y, X + Y, Z + Z, W + W]
shuf = _mm_movehl_ps(shuf, shuf); // [Z, W, Z, W]
return _mm_cvtss_f32(_mm_add_ss(sums, shuf)); // X + Y + Z
#elif PPSSPP_ARCH(ARM64_NEON)
float32x4_t multipled = vsetq_lane_f32(0.0f, vmulq_f32(a.vec, b.vec), 3);
float32x2_t add1 = vget_low_f32(vpaddq_f32(multipled, multipled));
Expand Down Expand Up @@ -311,7 +301,7 @@ static void ProcessSIMD(VertexData &vertex, const WorldCoords &worldpos, const W
// TODO: Should this normalize (0, 0, 0) to (0, 0, 1)?
float d = L.NormalizeOr001();

att = 1.0f / Dot33<useSSE4>(lstate.att, Vec3f(1.0f, d, d * d));
att = 1.0f / Dot33(lstate.att, Vec3f(1.0f, d, d * d));
if (!(att > 0.0f))
att = 0.0f;
else if (att > 1.0f)
Expand All @@ -320,7 +310,7 @@ static void ProcessSIMD(VertexData &vertex, const WorldCoords &worldpos, const W

float spot = 1.0f;
if (lstate.spot) {
float rawSpot = Dot33<useSSE4>(lstate.spotDir, L);
float rawSpot = Dot33(lstate.spotDir, L);
if (std::isnan(rawSpot))
rawSpot = std::signbit(rawSpot) ? 0.0f : 1.0f;

Expand All @@ -345,7 +335,7 @@ static void ProcessSIMD(VertexData &vertex, const WorldCoords &worldpos, const W
// diffuse lighting
float diffuse_factor;
if (lstate.diffuse || lstate.specular) {
diffuse_factor = Dot33<useSSE4>(L, worldnormal);
diffuse_factor = Dot33(L, worldnormal);
if (lstate.poweredDiffuse) {
diffuse_factor = pspLightPow(diffuse_factor, state.specularExp);
}
Expand All @@ -363,7 +353,7 @@ static void ProcessSIMD(VertexData &vertex, const WorldCoords &worldpos, const W
if (lstate.specular && diffuse_factor >= 0.0f) {
Vec3<float> H = L + Vec3<float>(0.f, 0.f, 1.f);

float specular_factor = Dot33<useSSE4>(H.NormalizedOr001(useSSE4), worldnormal);
float specular_factor = Dot33(H.NormalizedOr001(useSSE4), worldnormal);
specular_factor = pspLightPow(specular_factor, state.specularExp);

if (specular_factor > 0.0f) {
Expand Down