Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RenderingDevice - texture functions are no longer allowed in background threads due to thread guards. #99750

Closed
RPicster opened this issue Nov 27, 2024 · 22 comments

Comments

@RPicster
Copy link
Contributor

RPicster commented Nov 27, 2024

Tested versions

  • Reproducible in 4.4 (all dev versions)
  • Works reliable in 4.3

System information

Godot v4.4.dev5 - Windows 10.0.19045 - Single-window, 2 monitors - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 4080 (NVIDIA; 32.0.15.6094) - AMD Ryzen 7 3700X 8-Core Processor (16 threads)

Issue description

A small intro: I tried talking to @clayjohn and @BastiaanOlij about this behaviour, but so far I couldn't find out anything that enables me to move forward with it.

The setup in the MRP is what I essentially use in Engraving in different parts of the game to create an Image from a viewport texture in a asynchronous way, without stalling the game like a viewport.get_texture().get_image() would. For a lot of things it's not critical that the image is instantly available but it's more important to have a smooth game experience.

With the release of 4.4 this approach does not work at all anymore and keeps generating error messages, it works reliable in 4.3 stable.

My game is heavily depening on having a smooth way of creating an image from the players drawings to further process this image in real-time, so this change keeps me from updating the version and doing further testing on 4.4 (or even release in 4.4+).

If it is not possible to continue with this approach it would be greatly appreciated to find an alternative way of achieving async image generation from a viewport - which is also useful in a lot of other scenarios (vfx using viewports as textures/buffers).

Steps to reproduce

Checking out the MRP would be the easiest :)

Minimal reproduction project (MRP)

rd_copy_texture.zip

EDIT

After the discussions in this issue, the main challenge appeared to be getting texture / image data from the GPU to the CPU without a stall. It's not a regression per se - how I used to do what I needed to seemed to be a very flawed approach.
The best solution would be a new function, like suggested here by Dario or here by myself.

@BastiaanOlij
Copy link
Contributor

cc @DarioSamo , you might have some insight to this? I haven't had time to fully dive into what is going on here but I imagine we plugged some holes here especially when rendering runs on a separate thread.

@DarioSamo
Copy link
Contributor

DarioSamo commented Nov 27, 2024

cc @DarioSamo , you might have some insight to this? I haven't had time to fully dive into what is going on here but I imagine we plugged some holes here especially when rendering runs on a separate thread.

Yes, the main change that has been made, for thread-safety and correctness reasons, is that a lot of RenderingDevice calls are no longer allowed. The main reason is that RenderingDevice used to do a ton of concurrent work by just doing a Mutex basically everywhere and hoping for the best. This also meant that even if you had what looked like it was asynchronous generation working, it was in fact not asynchronous at all and merely slotting in the recording of the current frame wherever it could and getting done in the current frame, making your frame actually take longer than it should.

True asynchronous GPU work takes advantage of the fact that the hardware offers multiple queues to execute the work and has a much lesser impact on the work produced on the current frame. This is why 4.4 introduces the use of transfer queues that actually run in parallel to the game, which gave a huge improvement to loading speeds in V-Sync'd scenarios.

With the release of 4.4 this approach does not work at all anymore and keeps generating error messages, it works reliable in 4.3 stable.

The problem here is this gives the false appearance that it was working reliably, while in fact it was doing it in a way that introduces a ton of possible validation errors that it is practically impossible for us to assess. GPU Validation is a bad area to risk having problems because the problems are not reliably reproduced and might show up differently on different hardware.

A major change that 4.4 brings is the possibility to actually use different command queues that the driver offers, but RenderingDevice has no API to expose this at the moment. The closest we have is creating a local rendering device, but sadly that does not allow you to share resources with the main one as it does create an entirely new different driver in the background.

I realize this might be perceived as a regression in the eyes of users, but rolling back to the old design would keep hiding away the problem instead of addressing it properly. It was an explicit request by @reduz to introduce thread guards, so I think he should weigh in suggesting the API we should use to provide some functionality like this again to do it properly.

As for unblocking your approach in 4.4 in some way even if it's not the best solution, my suggestion would be to use the mechanism GDScript provides to defer calls to the main render thread. As I said, it was never actually processing the images in an asynchronous way anyway, as all it did was throw the processing that you requested into the main render queue for the current frame (which introduced several race conditions), so you can very much operate with a similar logic that just does deferred calls to the render thread and checks on the data on the next frame. This is what I mean by the fact that it wasn't truly asynchronous anyway, and it was pretty evident because assets loaded in background threads were never actually loaded and had to wait for V-Sync all the time as they'd just get thrown in whatever current frame was active and had to wait it out. The same problem was probably happening with the effect you mention.

@RPicster
Copy link
Contributor Author

RPicster commented Nov 27, 2024

Thanks for looking into it so quickly 👏
For me, the main difference was, that using the method in the MRP resulting in a much smoother experience vs using viewport.get_texture().get_image(). So, async or not - the result for me was the same, trading "getting the image immediately" for "having to stall something".

Could you give me a quick example on how I would implement "defer calls to the main render thread" or a working approach using mechanisms you describe? I can also reach out in the godot chat if that is preferred, but maybe it would be better to have it here in case someone else comes across this problem?

@DarioSamo
Copy link
Contributor

DarioSamo commented Nov 27, 2024

Could you give me a quick example on how I would implement "defer calls to the main render thread" or a working approach using mechanisms you describe? I can also reach out in the godot chat if that is preferred, but maybe it would be better to have it here in case someone else comes across this problem?

I think one big issue is that deferred calls might not actually help much if the use case is 'texture_get_data' because that function in particular has a bit of a design problem, where it has an extremely heavy amount of CPU work from copying the data back to RAM while the GPU itself has to do very little work. So if that function is now locked to being called in the render thread, it'll cause a big stall regardless of us moving to the more correct approach. One approach we discussed with @clayjohn is that functions such as get_data() would benefit from having asynchronous versions that you can call and query at a later point, allowing the data copy to happen independently from the rendering process once the GPU guarantees the texture or buffer is ready for it.

But to give you an example of something that was even worse, by using this function on a background thread, calling any function like get_data() basically requires flushing all of the queued frames of the game, pretty much demolishing any performance whatsoever the render thread was doing in being able to queue frames and keep the GPU busy! These are the kind of problems that not being explicit about what should be kept on the render thread causes, and it's completely hidden away from the user in the older approach.

Perhaps you can explain your problem space a bit better so we can avoid having to do this on the CPU and keep it entirely on the GPU instead. Is there a particular reason you need to download the image to the CPU and manipulate it there? Could it be perhaps achieved through compute shaders? And at what frequency do you need to fetch the viewport image. is it per frame or in an interval of time?

@RPicster
Copy link
Contributor Author

Perhaps you can explain your problem space a bit better so we can avoid having to do this on the CPU and keep it entirely on the GPU instead. Is there a particular reason you need to download the image to the CPU and manipulate it there? Could it be perhaps achieved through compute shaders? And at what frequency do you need to fetch the viewport image. is it per frame or in an interval of time?

Here is one example:

  • The player can draw the map inside a SubViewport
  • An image is created of this viewports texture
  • The image is converted to a bitmap.
  • grow_mask and opaque_to_polygons is then used to create a texture of filled spaces.
  • This texture is used in a shader to create the "filling" for the drawing.
    This happens while the player can draw and is saved / loaded when changing the sectors of the map (each sector having it's own map).
Godot_v4.4-dev5_win64_tLvo061xXo.mp4

In another example, visibility splats are generated when the player walks around which uncover the map.

  • When the player moves /rotates distance x, a new "splat" is generated, this is a "field of view" capture of their current view.
  • This "splat" is then converted to a "static" texture (viewport texture -> image -> texture) and added to another viewport as a sprite.
  • The splats will slowly fade out as new splats are added.

@DarioSamo
Copy link
Contributor

@RPicster I can see why you'd go the route of needing CPU manipulation if it's a gameplay mechanic.

Is the entire reason you went with the asynchronous approach because the get_data() routines were causing stalls? In that case it seems like the discussion should be less centered around being able to do rendering commands in a separate thread but rather, being able to request the data asynchronously. As you're clearly able to hide the latency for gameplay reasons, it sounds like you'd be better served by a scheme where you can do this sort of work in the main thread.

request_id = texture_get_data_async_request(); <- this gives you some sort of ID which you can use to query if the data is ready at a later point
... then during process on each later frame ...
texture_get_data_async_query(request_id, image); <- this allows you to query whether the texture's been copied to a data block that you can now copy off and process in a background thread

The main benefit of this approach would be that you no longer need to stall the GPU and wait for frames to be flushed compared to 4.3, which would give you higher performance.

@RPicster
Copy link
Contributor Author

Yes, you absolutely hit the nail on the head. In the end this was just a solution I found for the problem that is stalling.
something like:

func update_texture():
   viewport.get_texture().get_data_async(image_callback)

func image_callback(image:Image):
   ...

@DarioSamo
Copy link
Contributor

Alright, I'm not super aware about the details on how to implement something like this, but the idea has floated around long enough in our heads that it'd be useful to extend these methods to support this type of workflow. It seems 4.4's changes warrant they're basically a necessity at this point. CC @clayjohn

I don't think it'd take much work to get this done in time for 4.4 if it'd provide a solution to users who relied on this functionality to work around the lack of the function in the first place.

@RPicster
Copy link
Contributor Author

That would be a amazing. Thanks for all the feedback, I guess until anything new is ready I will get back to use get_image() and hope the experience is not too bad 👏

Should I close this issue then?

@DarioSamo
Copy link
Contributor

No, I'd suggest however making the title a bit clearer and update it with the solution we've discussed for any other maintainers who might see it and be unaware.

@DarioSamo DarioSamo changed the title RenderingDevice - texture_copy not working anymore RenderingDevice - texture functions are no longer allowed in background threads due to thread guards. Nov 27, 2024
@DarioSamo DarioSamo removed the bug label Nov 27, 2024
@darksylinc
Copy link
Contributor

darksylinc commented Nov 27, 2024

This would indeed need a new API.

In OgreNext we have AsyncTextureTicket which is conceptually similar to a Javascript promise: the user asks the engine it wants to download GPU data, and the engine returns a ticket (i.e. a handle).

The user now with that ticket can either chose to do a blocking download (which will stall), or get notified when the data is ready, receiving a callback, or poll the ticket until it returns ready.

Edit: Probably the only issue is how to determine that the ticket is no longer needed and can be thrown away (reference counted? user calls a free() function? the ticket becomes invalid after data has been obtained?).

Edit 2: In OgreNext tickets have 2 accuracy modes: inaccurate and accurate. Inaccurate means you have no problem always waiting a few frames (thus the engine can use the per frame fences to know if the download is ready). Accurate means we issue a fence to know when the download is ready. Useful when you need the data ASAP (but the user may still end up getting the data a few frames later if the GPU is too far behind the CPU, which increases the variance of waiting based on GPU load).

@clayjohn
Copy link
Member

@darksylinc I think we can make things quite simple for Godot because the current API just simply returns a PackedByteArray (which is problematic already because it requires two texture copies instead of one).

Right now you call texture_get_data(RID) and Godot will block until the data is ready, then copy it into a PackedByteArray and return it.

Instead we can add texture_get_data_async(RID, callback) which returns immediately, but calls the callback once the data is ready. The callback would always be a function with the structure func some_name(PackedByteArray) where the user can then do what they will with that array, the same as if they had called texture_get_data().

This implementation should be familiar to Godot users and very simple for us to implement. Later, we could even add an option to issue a fence or simple wait frame_count frames (could be an optional argument to texture_get_data_async()).

The downside of this approach is that it maintains our current approach of texture_get_data() copying into an intermediate buffer, then handing that buffer to the user to (potentially) copy into yet another buffer.

@clayjohn
Copy link
Member

@DarioSamo @RPicster With respect to calling things on the main thread. We actually have a mechanism for this built in already RenderingServer.call_on_render_thread which allows you to call any function on the rendering thread. All RD API functions that throw the thread guard warning should be called from inside a function wrapped in call_on_render_thread, then they should behave almost the same as before.

For an example of how that looks in practice, you can see my cloud demo: https://github.com/clayjohn/godot-volumetric-cloud-demo-v2/blob/c65cb2b6e5aa05205852bcf30844a8e600d683ba/cloud_sky/cloud_sky.gd#L117

@darksylinc
Copy link
Contributor

darksylinc commented Nov 27, 2024

@clayjohn On the risk of going slightly off topic: Since Dario's change broke multiple projects*; is it worth naming next version 4.4? But rather 5.0?

* The breaking change is not an API or ABI change but rather observed behavior in the details. But ultimately backwards compatibility is broken. As Dario mentioned, the way Godot allowed some users to do things in <= 4.3 would lead to crashes, device losts; and in a way that was very hard to track down. So it's not something that can easily be reverted.

@DarioSamo
Copy link
Contributor

DarioSamo commented Nov 27, 2024

@clayjohn On the risk of going slightly off topic: Since Dario's change broke multiple projects*; is it worth naming next version 4.4? But rather 5.0?

The amount of users who are direct users of RenderingDevice along with background threads is far too low to consider such a version change and we've done a lot worse breaking changes API-wise over the 4.x versions, so I'd say we're fine. Even on this particular case we've found the answer to the problem wasn't the fact that RenderingDevice has this new restriction but rather that the API was lacking a solution to download the image in the proper way. The OP's methodology has a ton of performance implications we wouldn't want to support in the long term anyway, as it effectively nullifies any amount of work that goes into optimizing our GPU throughput.

@clayjohn
Copy link
Member

@darksylinc We don't follow Semvar too strictly. We allow some amount of breakage between minor versions. Godot 5.0 will be when we want to majorly rework the architecture again and drop support for a lot of older and lower-end devices.

Also, backwards compatibility is only broken for certain projects that were abusing workarounds that we had in place due to our unfinished architecture. The original design of the RD wasn't thread safe. As users tried to call it from different threads, we added a ton of mutexes so their game wouldn't crash. But that wasn't our original design plan, that was a workaround to avoid crashes. Then, users began to rely on the mutexes and do more stuff from separate threads. Projects that are written to use our intended design (like my cloud project listed above) don't require any changes. To be clear, this is totally our fault, but I think it is an acceptable change for a minor version

@DarioSamo
Copy link
Contributor

There's now a work in progress PR to address this issue.

#100110

@clayjohn
Copy link
Member

clayjohn commented Dec 11, 2024

I've tested the MRP on 2 devices now to figure out what the issues are. I have run into a number of problems. But before that, I should point out that the MRP doesn't use multiple threads. So it doesn't even engage much of the problems discussed in the OP.

First, on all devices I get

E 0:00:00:0743   copy.gd:45 @ create_texture_copy(): Unable to convert array index 0 from "int" to "PackedByteArray".
  <C++ Error>    Method/function failed.
  <C++ Source>   core/variant/array.cpp:261 @ assign()
  <Stack Trace>  copy.gd:45 @ create_texture_copy()
                 copy.gd:21 @ _init()
                 test.gd:6 @ _ready()

This appears to be a regression in how TypedArrays are handled in GDScript. The PackedByte Array is getting converted into an array of integers instead of an Array of PackedByteArrays.

The error comes from this line of code:

rt_target_texture_rid  = rd.texture_create(format, view, rd.texture_get_data(rt_origin_texture_rid, 0))

This may never have actually been valid GDScript. At any rate the correct way to call this is with an array like so:

rt_target_texture_rid  = rd.texture_create(format, view, [rd.texture_get_data(rt_origin_texture_rid, 0)])

Next, on MacOS I get an error that depth textures can't use the TEXTURE_USAGE_CPU_READ_BIT. There is clearly a false positive in the metal backend since I double checked the format used and its not a depth texture.

E 0:00:00:0820   copy.gd:48 @ create_texture_copy(): Linear (TEXTURE_USAGE_CPU_READ_BIT) textures must not be a depth/stencil format
  <C++ Error>    Condition "ft != MTLFormatType::DepthStencil" is true. Returning: ERR_CANT_CREATE
  <C++ Source>   drivers/metal/rendering_device_driver_metal.mm:187 @ is_valid_linear()
  <Stack Trace>  copy.gd:48 @ create_texture_copy()
                 copy.gd:23 @ _init()
                 test.gd:8 @ _ready()

CC @stuartcarnie

Finally, on Linux in 4.4 I get the following:

E 0:00:00:0770   copy.gd:59 @ get_texture_data(): vkMapMemory failed with error -5.
  <C++ Error>    Condition "err" is true. Returning: nullptr
  <C++ Source>   drivers/vulkan/rendering_device_driver_vulkan.cpp:2030 @ texture_map()
  <Stack Trace>  copy.gd:59 @ get_texture_data()

This error goes away when removing the TEXTURE_USAGE_CPU_READ_BIT. I can reproduce this error as far back as 4.4 dev1 so it is most likely a regression from #90993 CC @darksylinc

@darksylinc
Copy link
Contributor

darksylinc commented Dec 11, 2024

This error goes away when removing the TEXTURE_USAGE_CPU_READ_BIT. I can reproduce this error as far back as 4.4 dev1 so it is most likely a regression from #90993 CC @darksylinc

Are you sure that's the right PR? That PR was very harmless as it only added debug routines for diagnostics.
The only thing that was controversial was the draw_list_begin() routine which was troublesome to keep API/ABI compatibility (and given you're having gdscript errors, I wouldn't rule it out this is the problem).

Next, on MacOS I get an error that depth textures can't use the TEXTURE_USAGE_CPU_READ_BIT. There is clearly a false positive in the metal backend since I double checked the format used and its not a depth texture.

That sounds like #98390, which remains unfixed as of today.

To check if it that's the problem, you can apply this patch (note: this is NOT a real fix. It's not a proper fix. But it will shut up validation):

diff --git a/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp b/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
index 8367f6e29d..f073bf6676 100644
--- a/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
+++ b/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
@@ -409,7 +409,7 @@ RID RendererCanvasRenderRD::_create_base_uniform_set(RID p_to_render_target, boo
                RD::Uniform u;
                u.uniform_type = RD::UNIFORM_TYPE_TEXTURE;
                u.binding = 4;
-               u.append_id(state.shadow_texture);
+               u.append_id(state.shadow_depth_texture);
                uniforms.push_back(u);
        }
 
@@ -1898,10 +1898,10 @@ RendererCanvasRenderRD::RendererCanvasRenderRD() {
                tf.texture_type = RD::TEXTURE_TYPE_2D;
                tf.width = 4;
                tf.height = 4;
-               tf.usage_bits = RD::TEXTURE_USAGE_SAMPLING_BIT;
-               tf.format = RD::DATA_FORMAT_R32_SFLOAT;
+               tf.usage_bits = RD::TEXTURE_USAGE_SAMPLING_BIT|RD::TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT;
+               tf.format = RD::DATA_FORMAT_D32_SFLOAT;
 
-               state.shadow_texture = RD::get_singleton()->texture_create(tf, RD::TextureView());
+               state.shadow_depth_texture = RD::get_singleton()->texture_create(tf, RD::TextureView());
        }
 
        {

@clayjohn
Copy link
Member

That sounds like #98390, which remains unfixed as of today.

I don't think so. That validation layer error has been around for a long time. It is present in 4.3.

The reason I think it is #90993 is because we switched from using vmaCreateImage to vkCreateImage with vmaAllocateMemoryForImage and vmaBindImageMemory2

Also, we switched texture_unmap() from using vkUnmapMemory() to vmaUnmapMemory(). Weirdly texture_map() uses both vkMapMemory() and vmaMapMemory(), does that seem right?

@RPicster
Copy link
Contributor Author

An udpate from my side: I tested #100110 and it absolutely solves the challenge I had in my project in a very graceful way.
I'm not sure if we should keep this issue open as there are still discussion going on.

@clayjohn clayjohn moved this from Very Bad to Not Critical in 4.x Release Blockers Dec 13, 2024
@clayjohn
Copy link
Member

We could probably split those other regressions I noticed into new bug reports so they can be tracked separately since the main reason for this report is now fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Not Critical
Development

No branches or pull requests

7 participants