-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getAlignedMutexWithInit is extremely slow #522
Comments
The issue with mutex is the That why I wrapped the After further analysis, it seems the underlying (Ah yes, Mac M1 :) that's some nice hardware, glad to see some linux+mesa+box64 working fine there 👍 I need to recheck how things run there) |
Not convinced of this, if the alignment difference doesn't matter, and nothing on the arm side writes (reads) the last 8 bytes of the structure (because it is purely padding that only exists for alignment and nothing else), then it should be safe, no? |
Yes, exactly. Only the init write the last 8 bytes. So, I have a test build where I removed all the heavy stuff i wrote previously, and just use the "naked" mutex as-is, exept on init, were I take care of the trailling 8bytes. And indeed it seems to works fine. All the Unity3D games I tried worked, Java also seems happy with it. I'll check a few more games with custom engines and some Windows stuff and push that soon. |
Here it is. As you can see, it's much more simple now. Just taking care of the init part. You should get your speed back now. I'm curious what application is that that create so many mutexes? |
Yep, that solves the issue for me. Thank you! 👍 |
On a workload I'm testing, Box64 (with #521) is slow. Profiling, an inordinate amount of time is spent within getAlignedMutexWithInit, seemingly due to the application constantly creating and destroying pthread mutexes. That's a silly thing for the application to do, of course, but if I had the application source code to fix, I would just recompile for arm64 😉
The following patch massively improves performance -- 10x fps in some cases -- taking the application from slideshow to a smooth 60fps..
Of course, this patch is probably wrong, since now we're potentially doing operations on unaligned mutexes, which I assume could blow up at any moment. Though it does seem to work fine for this application, somehow.
I see potentially two issues here:
mutexes
should be in thread-local storage to eliminate themutex_mutexes
locking, which is the heaviest weight operation here, andtaken
should be replaced with a bitset to allocate mutexes within the block in constant-time instead of the linear-time search. Neither of these changes should be difficult but I don't want to attempt them before I understand the bigger picture of what this file is for.Bug report aside, it was pretty magical to see this old application come to life on my M1 Linux, with full GPU acceleration and everything. If this is where we are now, I can't wait to see where we'll be once we have 4K pages + TSO sorted in the kernel 😄
The text was updated successfully, but these errors were encountered: