getAlignedMutexWithInit is extremely slow #522

alyssarosenzweig · 2023-02-21T17:15:17Z

On a workload I'm testing, Box64 (with #521) is slow. Profiling, an inordinate amount of time is spent within getAlignedMutexWithInit, seemingly due to the application constantly creating and destroying pthread mutexes. That's a silly thing for the application to do, of course, but if I had the application source code to fix, I would just recompile for arm64 😉

The following patch massively improves performance -- 10x fps in some cases -- taking the application from slideshow to a smooth 60fps..

 pthread_mutex_t* getAlignedMutex(pthread_mutex_t* m)
 {
-       return getAlignedMutexWithInit(m, 1);
+      return m;

Of course, this patch is probably wrong, since now we're potentially doing operations on unaligned mutexes, which I assume could blow up at any moment. Though it does seem to work fine for this application, somehow.

I see potentially two issues here:

box64 might be wrapping mutexes to ensure ABI compatibility in places where it might not actually be necessary (else, I would expect the above patch to blow up spectacularly, maybe I'm just getting lucky)
box64's wrapped mutex code itself is inefficient. I don't really understand why the code is doing what it is, so I can't suggest good fixes. I see that NewMutex open-codes an arena allocator based on a global unrolled linked list, I don't know why it does that, if that's supposed to be an optimization over a plain calloc or whether there's some subtle correctness issue here. If this really is the way to go, then mutexes should be in thread-local storage to eliminate the mutex_mutexes locking, which is the heaviest weight operation here, and taken should be replaced with a bitset to allocate mutexes within the block in constant-time instead of the linear-time search. Neither of these changes should be difficult but I don't want to attempt them before I understand the bigger picture of what this file is for.

Bug report aside, it was pretty magical to see this old application come to life on my M1 Linux, with full GPU acceleration and everything. If this is where we are now, I can't wait to see where we'll be once we have 4K pages + TSO sorted in the kernel 😄

The text was updated successfully, but these errors were encountered:

ptitSeb · 2023-02-21T18:47:47Z

The issue with mutex is the sizeof(pthread_mutex_t): it's 48 on aarch64 linux, but only 40 on x86_64. If you let pthread_mutex_function unwrapped, thy might (will) overwrite some important stuff at some point.

That why I wrapped the pthread_mutex_t structure, in an attempt to have a fast wrapping. It seems it's not. I'm unsure what the best solution is. Because you can create a temporary pthread_mutex_t directly on the stack, and never be destroyed, leaking ressources as I create an actual mutex else where for each mutex.

After further analysis, it seems the underlying struct __pthread_mutex_s are exactly the same on both arch, but for some reason __SIZEOF_PTHREAD_MUTEX_T is defined to 48 instead of 40.
In that case, only the init (and maybe the destroy?) of a mutex should be taken care of... I'll work on an alternate way to wrap mutex (I don't remember if I have seen unaligned mutex in x86_64 world, in i386 yes for sure, but in x86_64, probably not)

(Ah yes, Mac M1 :) that's some nice hardware, glad to see some linux+mesa+box64 working fine there 👍 I need to recheck how things run there)

alyssarosenzweig · 2023-02-21T19:45:46Z

The issue with mutex is the sizeof(pthread_mutex_t): it's 48 on aarch64 linux, but only 40 on x86_64. If you let pthread_mutex_function unwrapped, thy might (will) overwrite some important stuff at some point.... the underlying struct __pthread_mutex_s are exactly the same on both arch, but for some reason __SIZEOF_PTHREAD_MUTEX_T is defined to 48 instead of 40.

Not convinced of this, if the alignment difference doesn't matter, and nothing on the arm side writes (reads) the last 8 bytes of the structure (because it is purely padding that only exists for alignment and nothing else), then it should be safe, no?

ptitSeb · 2023-02-21T19:49:33Z

Yes, exactly. Only the init write the last 8 bytes. So, I have a test build where I removed all the heavy stuff i wrote previously, and just use the "naked" mutex as-is, exept on init, were I take care of the trailling 8bytes. And indeed it seems to works fine. All the Unity3D games I tried worked, Java also seems happy with it. I'll check a few more games with custom engines and some Windows stuff and push that soon.

ptitSeb · 2023-02-21T20:08:46Z

Here it is. As you can see, it's much more simple now. Just taking care of the init part. You should get your speed back now.

I'm curious what application is that that create so many mutexes?

alyssarosenzweig · 2023-02-28T00:41:24Z

Yep, that solves the issue for me. Thank you! 👍

ptitSeb added a commit that referenced this issue Feb 21, 2023

Simply handling of pthread_mutex_t structure (for #522)

5c1dbc0

alyssarosenzweig closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getAlignedMutexWithInit is extremely slow #522

getAlignedMutexWithInit is extremely slow #522

alyssarosenzweig commented Feb 21, 2023

ptitSeb commented Feb 21, 2023 •

edited

Loading

alyssarosenzweig commented Feb 21, 2023

ptitSeb commented Feb 21, 2023 •

edited

Loading

ptitSeb commented Feb 21, 2023 •

edited

Loading

alyssarosenzweig commented Feb 28, 2023

getAlignedMutexWithInit is extremely slow #522

getAlignedMutexWithInit is extremely slow #522

Comments

alyssarosenzweig commented Feb 21, 2023

ptitSeb commented Feb 21, 2023 • edited Loading

alyssarosenzweig commented Feb 21, 2023

ptitSeb commented Feb 21, 2023 • edited Loading

ptitSeb commented Feb 21, 2023 • edited Loading

alyssarosenzweig commented Feb 28, 2023

ptitSeb commented Feb 21, 2023 •

edited

Loading

ptitSeb commented Feb 21, 2023 •

edited

Loading

ptitSeb commented Feb 21, 2023 •

edited

Loading