multithreading in vulkan rendererdevice #7163
Replies: 5 comments 4 replies
-
Incidentally, you can see the issue with only allowing one draw list per device by turning on multithreaded rendering and trying anything with draw lists - bad things happen any time you call draw_list_begin or draw_list_end on the main render device (in my case it seems to crash after about 3 or 4 frames usually). |
Beta Was this translation helpful? Give feedback.
-
Okay, current state of this now I've had a good play with / look at the internals of the vulkan renderingdevice / context. Current state of thingsCPU sideThe mutex locking works like this: draw_list_begin() # lock This means only one thread can be in between GPU sideFor single pass / non-split draw lists, every draw_list command goes into one of two vulkan command buffers - either the rendering or setup command buffer. This is then chucked into a single vulkan queue. What this means is that if you submit a bunch of things to the global renderingdevice, they will run in order of submission. This is inefficient. It also means that things which don't depend on each other may hold each other up until completed. Oh and long running compute tasks will hold up even rendering stuff that doesn't necessarily depend on them on the same device. This is a proper big performance issue for using the low level rendering commands, even if you're not submitting from multiple threads. SolutionsFor performance purposes, the key improvement that could be done here is to submit each draw list as a separate vulkan command buffer. There are two things to think about here - the vulkan command pool and the command list. For anyone who doesn't know, the command pool is roughly the memory that command lists are made out of. Both command pools and the command buffers them are externally synchronised, i.e. they can only be messed with from one thread at a time, and the application (us) is responsible for ensuring that. There are three options here:
Possible implementationsSo, personally I think 1 or 3 are the best options. Implementation would be as follows: For (1) make draw_list_begin create a new command buffer (from the existing command pool). draw_list_end would add a fence to the command list, so we can tell when it is no longer needed. We could keep a set of command_buffers to reuse, with submitted command lists being checked to see if they can be reused when draw_list_begin is called. Mutex locking remains as it is currently. For (3) we'd need a structure to keep track of command buffers and associated pools, fences, mutexes etc. Something like the one below. Then we'd keep hold of a set of them, so as not to have to recreate everything on every call to draw_list_begin, only making new ones if there are no free ones left.
@clayjohn - any comments on this welcome - I might have some time this weekend or next week to fiddle around if it makes sense. |
Beta Was this translation helpful? Give feedback.
-
Oh one other thought on this - draw_list_end, draw_list_next_render_pass etc. are all missing a DrawListID parameter. I need to change that for multithreaded rendering to work. Which makes this a gdscript API change. Will need doing anyway, however the thing goes |
Beta Was this translation helpful? Give feedback.
-
Some thoughts here.
So, to me for the reasons mentioned above, this proposal does not add any real value and only makes implementation significantly more complex for synchronization and validation. |
Beta Was this translation helpful? Give feedback.
-
There is also another problem with using a modern rendering API like Vulkan in a multi-threaded fashion for recording command buffers that is not obvious (i tried to do this at some point and missed it too) with it: Barrier layout changes If you use the same resource in multiple threads while recording a command buffer and this requires a layout change, layout changes in Vulkan requiere the previous and next layout in the command (because Vulkan itself is stateless). As such, you have to adjust from the right layout to the right layout. If you do this in multiple threads, you have no idea what layout you are adjusting from because the other thread may have done a layout adjustment before, then your barrier will be wrong. The only way to do this is actually not recording the actual command buffers in threads and record commands, and then create the command buffers when joining the commands and do the right layout changes. Of course this beats the purpose of using command buffers in threads. Sure, you can add even more command buffers in the middle as transitional command buffers, but at some point many Vulkan implementations start disliking when you have too many command buffers submitted because it breaks driver optimizations. But even then, there are also more problems with this approach, which is memory allocation. Command buffers use worst case allocation because clearing/allocating the memory is expensive. If you don't have any kind of deterministic usage for them, all can potentially become worst case, leading to a massive amount of memory allocation. In short, this is something an AAA game engine can do, where the environment of what is happening is very controlled and deterministic. It's not possible to do efficiently with a general purpose API that gives you a lot of freedom. And finally, rendering is moving to bindless to do heavy lifting, which means the pressure on the rendering API is diminishing every year, making this sort of threaded access to it less and less needed. Betting in this day for this type of approach is a dead end. |
Beta Was this translation helpful? Give feedback.
-
I'm looking through renderingdevice_vulkan with a thought to sorting out thread safety there and making it so that it is easier to write efficient multithreaded rendering/compute operations.
As I understand it from chat, the aim is for:
So the current situation is:
So as I see it there are two options here:
Leave renderingdevices as is (i.e. each one has it's own queue, but you can pretty strictly only submit one draw / compute list at a time), then implement a way of transferring textures between different (local/global) rendering devices so that you can for example do compute on a separate thread in parallel with rendering, then efficiently transfer the texture over for rendering (while computing another one on the compute thread).
Make command buffers/pools be per thread. So e.g. you could have one thread calling draw_list_begin, queueing commands and calling draw_list_end, and another thread doing the same, or doing compute, and as long as the corresponding draw list IDs were only accessed on the same thread as they were called from, it would mostly work nicely.
There are advantages and disadvantages of the two approaches -
has the advantage that each renderingdevice gets a queue, which for at least my typical use case of one queue for rendering, one queue for compute which are loosely synchronised means things can overall run together better. If you make loads of devices then there is going to be way more overhead though.
Means keeping track of a bit more state, but does mean you don't hit hardware limits, as you're running everything off the same queue, whilst you can still multitask nicely while making and submitting command buffers. Compute tasks still won't run as nicely in parallel with rendering as if they are split off into another device.
I think probably what would be nice would be to build option 2 (single queue, command buffer per thread), but then to add a minimal way of transferring a texture from local to global renderingdevices efficiently, so that you can use local renderingdevices to do offscreen rendering / compute, whilst having an efficient way to get these things back into the main godot renderer. That way people don't accidentally create tons of vulkan device queues, but if you need them, you can use them.
Beta Was this translation helpful? Give feedback.
All reactions