Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getAlignedMutexWithInit is extremely slow #522

Closed
alyssarosenzweig opened this issue Feb 21, 2023 · 5 comments
Closed

getAlignedMutexWithInit is extremely slow #522

alyssarosenzweig opened this issue Feb 21, 2023 · 5 comments

Comments

@alyssarosenzweig
Copy link
Contributor

On a workload I'm testing, Box64 (with #521) is slow. Profiling, an inordinate amount of time is spent within getAlignedMutexWithInit, seemingly due to the application constantly creating and destroying pthread mutexes. That's a silly thing for the application to do, of course, but if I had the application source code to fix, I would just recompile for arm64 😉

The following patch massively improves performance -- 10x fps in some cases -- taking the application from slideshow to a smooth 60fps..

 pthread_mutex_t* getAlignedMutex(pthread_mutex_t* m)
 {
-       return getAlignedMutexWithInit(m, 1);
+      return m;

Of course, this patch is probably wrong, since now we're potentially doing operations on unaligned mutexes, which I assume could blow up at any moment. Though it does seem to work fine for this application, somehow.

I see potentially two issues here:

  1. box64 might be wrapping mutexes to ensure ABI compatibility in places where it might not actually be necessary (else, I would expect the above patch to blow up spectacularly, maybe I'm just getting lucky)
  2. box64's wrapped mutex code itself is inefficient. I don't really understand why the code is doing what it is, so I can't suggest good fixes. I see that NewMutex open-codes an arena allocator based on a global unrolled linked list, I don't know why it does that, if that's supposed to be an optimization over a plain calloc or whether there's some subtle correctness issue here. If this really is the way to go, then mutexes should be in thread-local storage to eliminate the mutex_mutexes locking, which is the heaviest weight operation here, and taken should be replaced with a bitset to allocate mutexes within the block in constant-time instead of the linear-time search. Neither of these changes should be difficult but I don't want to attempt them before I understand the bigger picture of what this file is for.

Bug report aside, it was pretty magical to see this old application come to life on my M1 Linux, with full GPU acceleration and everything. If this is where we are now, I can't wait to see where we'll be once we have 4K pages + TSO sorted in the kernel 😄

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 21, 2023

The issue with mutex is the sizeof(pthread_mutex_t): it's 48 on aarch64 linux, but only 40 on x86_64. If you let pthread_mutex_function unwrapped, thy might (will) overwrite some important stuff at some point.

That why I wrapped the pthread_mutex_t structure, in an attempt to have a fast wrapping. It seems it's not. I'm unsure what the best solution is. Because you can create a temporary pthread_mutex_t directly on the stack, and never be destroyed, leaking ressources as I create an actual mutex else where for each mutex.

After further analysis, it seems the underlying struct __pthread_mutex_s are exactly the same on both arch, but for some reason __SIZEOF_PTHREAD_MUTEX_T is defined to 48 instead of 40.
In that case, only the init (and maybe the destroy?) of a mutex should be taken care of... I'll work on an alternate way to wrap mutex (I don't remember if I have seen unaligned mutex in x86_64 world, in i386 yes for sure, but in x86_64, probably not)

(Ah yes, Mac M1 :) that's some nice hardware, glad to see some linux+mesa+box64 working fine there 👍 I need to recheck how things run there)

@alyssarosenzweig
Copy link
Contributor Author

The issue with mutex is the sizeof(pthread_mutex_t): it's 48 on aarch64 linux, but only 40 on x86_64. If you let pthread_mutex_function unwrapped, thy might (will) overwrite some important stuff at some point.... the underlying struct __pthread_mutex_s are exactly the same on both arch, but for some reason __SIZEOF_PTHREAD_MUTEX_T is defined to 48 instead of 40.

Not convinced of this, if the alignment difference doesn't matter, and nothing on the arm side writes (reads) the last 8 bytes of the structure (because it is purely padding that only exists for alignment and nothing else), then it should be safe, no?

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 21, 2023

Yes, exactly. Only the init write the last 8 bytes. So, I have a test build where I removed all the heavy stuff i wrote previously, and just use the "naked" mutex as-is, exept on init, were I take care of the trailling 8bytes. And indeed it seems to works fine. All the Unity3D games I tried worked, Java also seems happy with it. I'll check a few more games with custom engines and some Windows stuff and push that soon.

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 21, 2023

Here it is. As you can see, it's much more simple now. Just taking care of the init part. You should get your speed back now.

I'm curious what application is that that create so many mutexes?

@alyssarosenzweig
Copy link
Contributor Author

Yep, that solves the issue for me. Thank you! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants