multithreading in vulkan rendererdevice #7163

joemarshall · 2023-06-26T13:00:49Z

joemarshall
Jun 26, 2023

I'm looking through renderingdevice_vulkan with a thought to sorting out thread safety there and making it so that it is easier to write efficient multithreaded rendering/compute operations.

As I understand it from chat, the aim is for:

draw list commands to work okay on any thread
compute to work on any thread
removal as much as possible of any lock dependencies between threads so multithreaded rendering stuff is possible

So the current situation is:

If you're submitting to a draw list or compute list, all other commands on the same renderingdevice are locked out until you finish submitting.
Everything goes into one big command buffer, which is submitted to one big queue per renderdevice.
You can't use multiple renderingdevices for multithreaded rendering / compute + rendering, because you can't transfer textures between renderingdevices.

So as I see it there are two options here:

Leave renderingdevices as is (i.e. each one has it's own queue, but you can pretty strictly only submit one draw / compute list at a time), then implement a way of transferring textures between different (local/global) rendering devices so that you can for example do compute on a separate thread in parallel with rendering, then efficiently transfer the texture over for rendering (while computing another one on the compute thread).
Make command buffers/pools be per thread. So e.g. you could have one thread calling draw_list_begin, queueing commands and calling draw_list_end, and another thread doing the same, or doing compute, and as long as the corresponding draw list IDs were only accessed on the same thread as they were called from, it would mostly work nicely.

There are advantages and disadvantages of the two approaches -

has the advantage that each renderingdevice gets a queue, which for at least my typical use case of one queue for rendering, one queue for compute which are loosely synchronised means things can overall run together better. If you make loads of devices then there is going to be way more overhead though.
Means keeping track of a bit more state, but does mean you don't hit hardware limits, as you're running everything off the same queue, whilst you can still multitask nicely while making and submitting command buffers. Compute tasks still won't run as nicely in parallel with rendering as if they are split off into another device.

I think probably what would be nice would be to build option 2 (single queue, command buffer per thread), but then to add a minimal way of transferring a texture from local to global renderingdevices efficiently, so that you can use local renderingdevices to do offscreen rendering / compute, whilst having an efficient way to get these things back into the main godot renderer. That way people don't accidentally create tons of vulkan device queues, but if you need them, you can use them.

joemarshall · 2023-06-26T16:27:20Z

joemarshall
Jun 26, 2023
Author

Incidentally, you can see the issue with only allowing one draw list per device by turning on multithreaded rendering and trying anything with draw lists - bad things happen any time you call draw_list_begin or draw_list_end on the main render device (in my case it seems to crash after about 3 or 4 frames usually).

1 reply

joemarshall Jun 28, 2023
Author

This particular crash is fixed by

godotengine/godot#78786

I think multithreaded rendering basically works now, albeit only one thread can write to a command list at once. I can't quite work out what the deal is with barriers and things - whether if you end a compute list without barriers, it may not block rendering.

joemarshall · 2023-06-29T12:48:57Z

joemarshall
Jun 29, 2023
Author

Okay, current state of this now I've had a good play with / look at the internals of the vulkan renderingdevice / context.

Current state of things

CPU side

The mutex locking works like this:

draw_list_begin() # lock
draw_list... () # drawing commands etc.
draw_list... () # drawing commands
draw_list... () # drawing commands
draw_list... () # drawing commands
draw_list_end() # unlock

This means only one thread can be in between draw_list_begin and draw_list_end at a time. Having said that, submitting commands is fast, so as long as you prepare data outside those two calls, multithreaded submission of rendering is fine - you can calculate a load of stuff in each thread, then submit a draw list for each thread.

GPU side

For single pass / non-split draw lists, every draw_list command goes into one of two vulkan command buffers - either the rendering or setup command buffer. This is then chucked into a single vulkan queue.

What this means is that if you submit a bunch of things to the global renderingdevice, they will run in order of submission. This is inefficient. It also means that things which don't depend on each other may hold each other up until completed. Oh and long running compute tasks will hold up even rendering stuff that doesn't necessarily depend on them on the same device. This is a proper big performance issue for using the low level rendering commands, even if you're not submitting from multiple threads.

Solutions

For performance purposes, the key improvement that could be done here is to submit each draw list as a separate vulkan command buffer. There are two things to think about here - the vulkan command pool and the command list. For anyone who doesn't know, the command pool is roughly the memory that command lists are made out of. Both command pools and the command buffers them are externally synchronised, i.e. they can only be messed with from one thread at a time, and the application (us) is responsible for ensuring that.

There are three options here:

Maintain a single command pool, but make a new command buffer for every call to draw_list_begin. Keep the simple mutex locking that currently exists. This would improve the GPU performance, whilst still placing constraints on use of draw_list commands.
Have a command pool per thread, and create command buffer for each draw_list_begin call from the current thread pool. This means that each draw list is locked to the thread it was created on. No need for mutexes, but need for a thread id check. It means keeping track of a small number of command pools.
Use a command pool / command buffer pair per call to draw_list_begin. This would mean no need for global mutexes, and draw lists could be used flexibly from any thread. Each draw list would need its own mutex, but that would only do anything in really weird edge cases where people draw to the same list from multiple threads.

Possible implementations

So, personally I think 1 or 3 are the best options.

Implementation would be as follows:

For (1) make draw_list_begin create a new command buffer (from the existing command pool). draw_list_end would add a fence to the command list, so we can tell when it is no longer needed. We could keep a set of command_buffers to reuse, with submitted command lists being checked to see if they can be reused when draw_list_begin is called. Mutex locking remains as it is currently.

For (3) we'd need a structure to keep track of command buffers and associated pools, fences, mutexes etc. Something like the one below. Then we'd keep hold of a set of them, so as not to have to recreate everything on every call to draw_list_begin, only making new ones if there are no free ones left.

struct CommandBufferAndPool
{
  VkCommandBuffer buffer;
  VkCommandPool pool;
  Mutex mutex;
  enum
  {
    RESET,
    WRITING,
    SUBMITTED
  } state;
  VkFence fence;
};

@clayjohn - any comments on this welcome - I might have some time this weekend or next week to fiddle around if it makes sense.

2 replies

clayjohn Jun 29, 2023
Maintainer

CC @RandomShaper, @BastiaanOlij

I don't have a clear picture of the tradeoffs between the various options. My gut feeling is that we need to reduce the amount of mutexes we are relying on, but without a clear picture of the tradeoffs I don't have a preference.

Why do you think 1 and 3 are better options than 2? Zeux's blog on writing an efficient Vulkan renderer suggests allocating frames_in_flight * num_threads number of command pools up front. We already allocate one per frame. Are you concerned about the memory usage? Or maybe its just overkill as it will only really be used when users choose to record from multiple frames?

joemarshall Jun 30, 2023
Author

Thinking more on this, I think you're right - (2) is fine actually, but I guess I was worried that it does introduce an edge case when someone tries to access the same draw list from multiple threads. If we're okay with banning that edge case (which as far as I can see is only blocking a pretty stupid thing to do anyway), then it does almost totally remove mutexes which is nice.

I would suggest allocating command pools (frame * 2 for consistency with current behaviour) for a thread the first time draw_list_begin is called on that thread. So there's a map from thread to command pool that looks like this:

struct ThreadCommandPools
{
    VKCommandPool draw_pools[NUM_FRAMES];
   VKCommandPool compute_pools; // maybe keep compute separate from drawing?
};

HashMap<Thread::ID,ThreadCommandPools> poolStorage;

Then in draw_list_begin, it has to:

lookup (or create) the poolStorage for the current thread
create a draw_list and return an RID for it.

And in each of the draw list calls, it has to first check current thread ID, then add to the cmd_buf for that draw list.

In draw_list_end it has to use the global mutex to submit command list to the vulkan queue.

I envisage something like this being used for each draw list:

struct DrawListInfo
{
VKCommandBuffer cmd_buf;
VKCommandPool owner; 
VkFence doneFence;
Thread::ID thread;
enum {
    WRITING, // we're preparing the list
    SUBMITTED, // owned by Vulkan
    DONE   // the fence has been signalled, this command is done now, the buffer is ready to be reused
};

One question I do have here is about potentially long running compute tasks - do they go in a separate command pool, so that we can just reset the normal command pools at end of frame rather than having to reset all the individual buffers, or do we just keep track of the fence on every command buffer and reset and reuse them individually?

joemarshall · 2023-07-01T15:49:21Z

joemarshall
Jul 1, 2023
Author

Oh one other thought on this - draw_list_end, draw_list_next_render_pass etc. are all missing a DrawListID parameter. I need to change that for multithreaded rendering to work. Which makes this a gdscript API change. Will need doing anyway, however the thing goes

0 replies

reduz · 2023-07-24T13:39:21Z

reduz
Jul 24, 2023
Maintainer

Some thoughts here.

Multithreaded compute operations are pointless, you are not going to be draw call bound here even on rendering thread. Imagining this as a scenario is so corner case that to me it's not worth thinking a solution for it.
Sharing state is not the problem here, the main problem is synchronization and validation. This becomes a true nightmare with multiple threads accessing the content. The primary goal of RenderingDevice is that users using it can get complete and faithful validation that everything being done will always either work or you will get a proper error message. Never crash or get into inconsistent state.

So, to me for the reasons mentioned above, this proposal does not add any real value and only makes implementation significantly more complex for synchronization and validation.

0 replies

reduz · 2023-07-24T14:06:05Z

reduz
Jul 24, 2023
Maintainer

There is also another problem with using a modern rendering API like Vulkan in a multi-threaded fashion for recording command buffers that is not obvious (i tried to do this at some point and missed it too) with it: Barrier layout changes

If you use the same resource in multiple threads while recording a command buffer and this requires a layout change, layout changes in Vulkan requiere the previous and next layout in the command (because Vulkan itself is stateless). As such, you have to adjust from the right layout to the right layout. If you do this in multiple threads, you have no idea what layout you are adjusting from because the other thread may have done a layout adjustment before, then your barrier will be wrong.

The only way to do this is actually not recording the actual command buffers in threads and record commands, and then create the command buffers when joining the commands and do the right layout changes. Of course this beats the purpose of using command buffers in threads.

Sure, you can add even more command buffers in the middle as transitional command buffers, but at some point many Vulkan implementations start disliking when you have too many command buffers submitted because it breaks driver optimizations.

But even then, there are also more problems with this approach, which is memory allocation. Command buffers use worst case allocation because clearing/allocating the memory is expensive. If you don't have any kind of deterministic usage for them, all can potentially become worst case, leading to a massive amount of memory allocation.

In short, this is something an AAA game engine can do, where the environment of what is happening is very controlled and deterministic. It's not possible to do efficiently with a general purpose API that gives you a lot of freedom.

And finally, rendering is moving to bindless to do heavy lifting, which means the pressure on the rendering API is diminishing every year, making this sort of threaded access to it less and less needed. Betting in this day for this type of approach is a dead end.

1 reply

joemarshall Aug 3, 2023
Author

Fair enough - I was sort of coming to this conclusion myself - it's all fine and good to have a command pool per thread per frame, but it isn't going to be such a massive speedup that it is worth doing I think. I implemented most of it, but the main renderer code has a bunch of points where it assumes everything is in one command buffer.

What I do want though is to be able to submit compute stuff to a different command buffer (and maybe a different queue, not sure) and then get the data into a shader. So I can have compute operations running in parallel with drawing ops, but then when the compute operation is done, the output is accessible to drawing without copying via CPU. That way one can run things like custom physics or compute tasks independent of display frame rate. So say for example, my compute shader calculated water effect can happily run at say 10fps, but I still want the camera to run at 60fps.

At the moment, to use a local rendering device I have the following :

Water shader takes as input: a) depth render - GPU, from main renderingdevice, b) water state - GPU (data buffer, secondary device)
Output is a heightmap (data buffer, secondary device)

Because compute on the main rendering device currently stalls the frame until the compute is done, that doesn't work for me. So it needs to be a local device. This means I have to:

Copy depth render from main -> secondary device via CPU
Run compute shader
Copy heightmap from secondary -> main device via CPU

Both 1 and 3 require waiting on barriers on the GPU->CPU transfer so my code is super slow.

For me, I guess what would be nice is:
. To be able to submit a compute job that goes in a separate command buffer, and track the completed state of it.
. For the output to then be able to be accessed by rendering shaders (or copied across to rendering textures on GPU).

Three possible implementations of this:

Keep localdevices as is, but provide a way to copy data between local and main rendering device (using KHR_External_memory somehow I guess)
Make localdevices actually run on the same device / queue as the main device, but with their own command buffer. That way copying data across is trivial. the way submit/sync works has to be replaced with use of fences (and ideally a query as well as a 'wait for' command), so they don't stall the main device. Given command buffers only begin in queue submission order, this may not do anything too bad to parallelism?
Make the main device have a separate command buffer for non-frame synchronised compute, which is submitted to the queue immediately on ending (i.e. by using either a new flag to compute_list_begin(), or a separate call like compute_list_async_begin() ). This would need to have it's own command pool, and create or reuse a command buffer + fence for each compute call. Unless one was to say that there could only be one async compute list in flight at a time, but I foresee that causing issues in future?

For me, I think (2) is probably the simplest change, but it adds dependencies that probably aren't desirable - like any call to texture_get_data will wait for device idle.

(1) means battling with external memory - vma does support this using pTypeExternalMemoryHandleTypes when creating the allocator, but I can't find any examples of anyone using it really.

(3) is I think the most comfortable change, in that it shouldn't affect current behaviour at all. Having said that, unless you take the approach of only allowing one in flight asynchronous computation at a time, there is the risk you talk about above about command pool memory fragmentation, especially if a program is written with interleaved compute tasks so that there's always one or more in flight at once. Although I suspect in common cases the command buffers will be similar in size each time, so if we reset each one on completion of the previous one then the next call to do the same thing will just fill the same buffer neatly?

@reduz Any opinions on this - I really want this stuff to work nicely in godot - GPU compute without stalling graphics, but with interoperation between graphics and compute seems like it really needs to be in somehow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multithreading in vulkan rendererdevice #7163

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

multithreading in vulkan rendererdevice #7163

joemarshall Jun 26, 2023

Replies: 5 comments · 4 replies

joemarshall Jun 26, 2023 Author

joemarshall Jun 28, 2023 Author

joemarshall Jun 29, 2023 Author

Current state of things

CPU side

GPU side

Solutions

Possible implementations

clayjohn Jun 29, 2023 Maintainer

joemarshall Jun 30, 2023 Author

joemarshall Jul 1, 2023 Author

reduz Jul 24, 2023 Maintainer

reduz Jul 24, 2023 Maintainer

joemarshall Aug 3, 2023 Author

joemarshall
Jun 26, 2023

Replies: 5 comments 4 replies

joemarshall
Jun 26, 2023
Author

joemarshall Jun 28, 2023
Author

joemarshall
Jun 29, 2023
Author

clayjohn Jun 29, 2023
Maintainer

joemarshall Jun 30, 2023
Author

joemarshall
Jul 1, 2023
Author

reduz
Jul 24, 2023
Maintainer

reduz
Jul 24, 2023
Maintainer

joemarshall Aug 3, 2023
Author