Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Parallel pipeline creation #16802

Merged
merged 5 commits into from
Feb 1, 2023
Merged

Conversation

hrydgard
Copy link
Owner

@hrydgard hrydgard commented Jan 12, 2023

Resurrected this old code in my efforts to make things comfortably playable on the A21 - see #16567.

I previously added parallel shader compilation, but that only took care of the GLSL generation and compilation to SPIR-V. This calls the actual pipeline creation in parallel on the threadpool.

I didn't expect it, but unlike Mali, the PowerVR driver benefits immensely from parallel pipeline creation. With this, shader stutter on the device is finally almost kinda bearable. Almost.

However, can't get this in just yet - there's a weird deadlock to debug, that happens mostly when a lot of shaders are created in a bunch. Haven't gotten it to happen on PC yet. Some kind of notify thing maybe?

Hm, since fixing #16804 and rebasing, I haven't seen the hang again, I think. We should probably just get this in and see what happens...

It might be a good idea to disable this on GPUs where we know it doesn't help, too.

Also, first I thought about deleting the "compiler thread" as well, but that now performs pretty important job of "smartly" distributing the compile jobs after implicitly collecting bunches of them. Though not sure the benefit of that is very great...

@hrydgard hrydgard added this to the v1.15.0 milestone Jan 12, 2023
@hrydgard hrydgard marked this pull request as draft January 12, 2023 14:34
@hrydgard hrydgard force-pushed the parallel-pipeline-creation branch from 535caa3 to 0237606 Compare January 13, 2023 09:46
@hrydgard hrydgard marked this pull request as ready for review January 13, 2023 15:00
@hrydgard
Copy link
Owner Author

Oh, I managed to reproduce the hang on Mac! Where the callstacks were a bit unexpected:

0   libsystem_kernel.dylib        	       0x193041564 __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x19307d638 _pthread_cond_wait + 1232
2   libc++.1.dylib                	       0x192fcaac4 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   PPSSPPSDL                     	       0x1045c530c WaitableCounter::Wait() + 64
4   PPSSPPSDL                     	       0x1045c4f44 ParallelRangeLoop(ThreadManager*, std::__1::function<void (int, int)> const&, int, int, int) + 132
5   PPSSPPSDL                     	       0x104308c3c ElfReader::LoadRelocations(Elf32_Rel const*, int) + 156
6   PPSSPPSDL                     	       0x104309cb0 ElfReader::LoadInto(unsigned int, bool) + 2444
7   PPSSPPSDL                     	       0x104381014 __KernelLoadELFFromPtr(unsigned char const*, unsigned long, unsigned int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, unsigned int*, unsigned int&) + 2420
8   PPSSPPSDL                     	       0x104384284 sceKernelLoadModule(char const*, unsigned int, unsigned int) + 1704
9   PPSSPPSDL                     	       0x104387ae4 void WrapU_CUU<&(sceKernelLoadModule(char const*, unsigned int, unsigned int))>() + 56
10  PPSSPPSDL                     	       0x104324080 CallSyscallWithoutFlags(HLEFunction const*) + 56
11  ???                           	       0x115e3ce4c ???
12  PPSSPPSDL                     	       0x104419548 MIPSState::RunLoopUntil(unsigned long long) + 300
13  PPSSPPSDL                     	       0x10444aba0 PSP_RunLoopWhileState() + 96
14  PPSSPPSDL                     	       0x1041c41f8 EmuScreen::render() + 24
Thread 3:: PoolWorker 1
0   libsystem_kernel.dylib        	       0x193041564 __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x19307d638 _pthread_cond_wait + 1232
2   libc++.1.dylib                	       0x192fcaac4 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   PPSSPPSDL                     	       0x10448730c Promise<VkShaderModule_T*>::BlockUntilReady() + 100
4   PPSSPPSDL                     	       0x1045a0710 VKRGraphicsPipeline::Create(VulkanContext*, VkRenderPass_T*, RenderPassType, VkSampleCountFlagBits, double) + 180
5   PPSSPPSDL                     	       0x1045a653c CreateMultiPipelinesTask::Run() + 52
6   PPSSPPSDL                     	       0x1045c6250 WorkerThreadFunc(GlobalThreadContext*, ThreadContext*) + 292
7   PPSSPPSDL                     	       0x1045c7228 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(GlobalThreadContext*, ThreadContext*), GlobalThreadContext*, ThreadContext*> >(void*) + 48
Thread 6:: PoolWorker 4
0   libsystem_kernel.dylib        	       0x193040a1c __psynch_mutexwait + 8
1   libsystem_pthread.dylib       	       0x19307a144 _pthread_mutex_firstfit_lock_wait + 84
2   libsystem_pthread.dylib       	       0x193077a9c _pthread_mutex_firstfit_lock_slow + 248
3   libc++.1.dylib                	       0x192fcc9e8 std::__1::mutex::lock() + 16
4   PPSSPPSDL                     	       0x1044872cc Promise<VkShaderModule_T*>::BlockUntilReady() + 36
5   PPSSPPSDL                     	       0x1045a0710 VKRGraphicsPipeline::Create(VulkanContext*, VkRenderPass_T*, RenderPassType, VkSampleCountFlagBits, double) + 180
6   PPSSPPSDL                     	       0x1045a653c CreateMultiPipelinesTask::Run() + 52
7   PPSSPPSDL                     	       0x1045c6250 WorkerThreadFunc(GlobalThreadContext*, ThreadContext*) + 292
8   PPSSPPSDL                     	       0x1045c7228 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(GlobalThreadContext*, ThreadContext*), GlobalThreadContext*, ThreadContext*> >(void*) + 48
Thread 8:: PoolWorker 6
0   libsystem_kernel.dylib        	       0x193041564 __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x19307d638 _pthread_cond_wait + 1232
2   libc++.1.dylib                	       0x192fcaac4 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   PPSSPPSDL                     	       0x10448730c Promise<VkShaderModule_T*>::BlockUntilReady() + 100
4   PPSSPPSDL                     	       0x1045a0710 VKRGraphicsPipeline::Create(VulkanContext*, VkRenderPass_T*, RenderPassType, VkSampleCountFlagBits, double) + 180
5   PPSSPPSDL                     	       0x1045a653c CreateMultiPipelinesTask::Run() + 52
6   PPSSPPSDL                     	       0x1045c6250 WorkerThreadFunc(GlobalThreadContext*, ThreadContext*) + 292
7   PPSSPPSDL                     	       0x1045c7228 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(GlobalThreadContext*, ThreadContext*), GlobalThreadContext*, ThreadContext*> >(void*) + 48
8   libsystem_pthread.dylib       	       0x19307d06c _pthread_start + 148
9   libsystem_pthread.dylib       	       0x193077e2c thread_start + 8

Thread 9:: PoolWorker 7
0   libsystem_kernel.dylib        	       0x193041564 __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x19307d638 _pthread_cond_wait + 1232
2   libc++.1.dylib                	       0x192fcaac4 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   PPSSPPSDL                     	       0x10448730c Promise<VkShaderModule_T*>::BlockUntilReady() + 100
4   PPSSPPSDL                     	       0x1045a0710 VKRGraphicsPipeline::Create(VulkanContext*, VkRenderPass_T*, RenderPassType, VkSampleCountFlagBits, double) + 180
5   PPSSPPSDL                     	       0x1045a653c CreateMultiPipelinesTask::Run() + 52
6   PPSSPPSDL                     	       0x1045c6250 WorkerThreadFunc(GlobalThreadContext*, ThreadContext*) + 292
7   PPSSPPSDL                     	       0x1045c7228 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(GlobalThreadContext*, ThreadContext*), GlobalThreadContext*, ThreadContext*> >(void*) + 48
8   libsystem_pthread.dylib       	       0x19307d06c _pthread_start + 148
9   libsystem_pthread.dylib       	       0x193077e2c thread_start + 8
Thread 24:: RenderMan
0   libsystem_kernel.dylib        	       0x193041564 __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x19307d638 _pthread_cond_wait + 1232
2   libc++.1.dylib                	       0x192fcaac4 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   PPSSPPSDL                     	       0x1045a1b24 VulkanRenderManager::ThreadFunc() + 100
4   PPSSPPSDL                     	       0x1045a6be0 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (VulkanRenderManager::*)(), VulkanRenderManager*> >(void*) + 64
5   libsystem_pthread.dylib       	       0x19307d06c _pthread_start + 148
6   libsystem_pthread.dylib       	       0x193077e2c thread_start + 8

Thread 25:: ShaderCompile
0   libsystem_kernel.dylib        	       0x193041564 __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x19307d638 _pthread_cond_wait + 1232
2   libc++.1.dylib                	       0x192fcaac4 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   PPSSPPSDL                     	       0x1045a1d5c VulkanRenderManager::CompileThreadFunc() + 192
4   PPSSPPSDL                     	       0x1045a6be0 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (VulkanRenderManager::*)(), VulkanRenderManager*> >(void*) + 64
5   libsystem_pthread.dylib       	       0x19307d06c _pthread_start + 148
6   libsystem_pthread.dylib       	       0x193077e2c thread_start + 8

(skipped all the threads that didn't seem to do anything interesting)

Very odd.

@hrydgard
Copy link
Owner Author

I think this might be an issue where the thread pool is already full of blocked tasks, that are waiting for a task that's scheduled to run on the threadpool. Ugh, might need some kind of priority scheme?

@anr2me
Copy link
Collaborator

anr2me commented Jan 17, 2023

Are threads that are waiting aren't suspended or rescheduled to let other threads (which might be the one being waited for) in the threadpool to run?

@unknownbrackets
Copy link
Collaborator

Correct, they aren't. That would require a multiplexing system like libuv or something (which is how nodejs works.) PPSSPP's thread manager simply schedules tasks and waits for them to complete before the thread can run more. If it waits for something else, its place on the queue is blocked until it's done.

-[Unknown]

hrydgard added a commit that referenced this pull request Jan 31, 2023
…r interface.

Useful for things that should be run ASAP even if the threadpool is full,
at a small extra cost. (Not recommended for very small tasks).

Considering using this to resolve the deadlocks in #16802.
@hrydgard hrydgard force-pushed the parallel-pipeline-creation branch from 0237606 to d772b7b Compare January 31, 2023 11:26
@hrydgard
Copy link
Owner Author

I did a bit of a brute force solution, where the tasks that might be depended on by other tasks (the VkShaderModule creation) is now run on dedicated threads instead of on the pool. Overhead should be pretty negligble compared to the shader build costs.

This seems fairly solid now, will test it a little more though.

@hrydgard hrydgard force-pushed the parallel-pipeline-creation branch from d772b7b to a67604d Compare February 1, 2023 10:43
@hrydgard
Copy link
Owner Author

hrydgard commented Feb 1, 2023

This seems solid now, and it's really night and day on PowerVR devices. ARM Mali devices aren't expected to benefit much until the very latest drivers unfortunately. Will test a little more and then go for it.

@ghost
Copy link

ghost commented Feb 1, 2023

This seems solid now, and it's really night and day on PowerVR devices. ARM Mali devices aren't expected to benefit much until the very latest drivers unfortunately. Will test a little more and then go for it.

How about adreno gpu's?

@hrydgard
Copy link
Owner Author

hrydgard commented Feb 1, 2023

Just tried, seems to be noticeably better on Adreno as well.

On my Poco F1, shader stutter isn't entirely gone, but there's definitely less of it.

@hrydgard hrydgard merged commit 2ed88a8 into master Feb 1, 2023
@hrydgard hrydgard deleted the parallel-pipeline-creation branch February 1, 2023 11:23
@ghost ghost mentioned this pull request Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants