Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional driver workaround to RenderingDevice for Adreno 6XX. #91514

Merged
merged 1 commit into from
May 14, 2024

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented May 3, 2024

Fixes #90459. This was a perceived regression by the end user as #84976 was the one that introduced the crash. However, the reality turned out to be the graph exposed a driver bug we were successfully dodging in Mobile. This very well could explain why Forward+ was prone to crashing on this family of devices.

Quoting from the PR itself for the in-depth explanation.

Workaround for the Adreno 6XX family of devices.

There's a known issue with the Vulkan driver in this family of devices where it'll crash if a dynamic state for drawing is used in a command buffer before a dispatch call is issued. As both dynamic scissor and viewport are basic requirements for the engine to not bake this state into the PSO, the only known way to fix this issue is to reset the command buffer entirely.

As the render graph has no built in limitations of whether it'll issue compute work before anything needs to draw on the frame, and there's no guarantee that compute work will never be dependent on rasterization in the future, this workaround will end recording on the current command buffer any time a compute list is encountered after a draw list was executed. A new command buffer will be created afterwards and the appropriate synchronization primitives will be inserted.

Executing this workaround has the added cost of synchronization between all the command buffers that are created as well as all the individual submissions. This performance hit is accepted for the sake of being able to support these devices without limiting the design of the renderer.

TODO:

  • Implement the logic for enabling this workaround here only for the affected devices. Tagging @clayjohn as he has a device that can reproduce the issue.

@DarioSamo
Copy link
Contributor Author

Clay has added the device detection and confirmed this to be working.

@TCROC
Copy link
Contributor

TCROC commented May 3, 2024

I just tested and can confirm this PR fixes it! Great work everyone! Shoutout to @clayjohn @DarioSamo and @TokageItLab !! :)

@darksylinc
Copy link
Contributor

The device detection code has the same problem as checking for "Windows 9". It will falsely flag Adreno 6000 if it ever comes out.

The proper way to check is via vendor id + device id + name string. Like this:

if( physical_device_properties.vendorID == 0x5143 && // Qualcomm
     physical_device_properties.deviceID >= 0x6000000 && // Adreno 6xx
     physical_device_properties.deviceID < 0x7000000 &&
     strstr( physical_device_properties.deviceName, "Turnip" ) == nullptr
)
{
    // Enable workaround.
}

Of course there's still always the chance Qualcomm one day starts using unused deviceID entries in the range [0x6000000; 0x7000000) but it should narrow down the chance of false positives.

We still use string evaluation to rule out Turnip (i.e. Mesa's FOSS driver).

@DarioSamo
Copy link
Contributor Author

@darksylinc I agree with your comment, but I'll leave @clayjohn to make the decision on that as he coded the current detection. I have no problems with upgrading it to do that.

@clayjohn
Copy link
Member

clayjohn commented May 6, 2024

That sounds great to me!

I just copied what we do for OpenGL which comes with a TODO:

//Adreno 3xx Compatibility
const String rendering_device_name = String::utf8((const char *)glGetString(GL_RENDERER));
//TODO: Check the number between 300 and 399(?)
adreno_3xx_compatibility = (rendering_device_name.left(13) == "Adreno (TM) 3");

Edit: On second thought, this will not be trivial. It seems to rely on Vulkan-specific properties that we don't expose through the RD right now. So implementing @darksylinc's solution will require exposing a lot more information than we currently have access to.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented May 7, 2024

Edit: On second thought, this will not be trivial. It seems to rely on Vulkan-specific properties that we don't expose through the RD right now. So implementing @darksylinc's solution will require exposing a lot more information than we currently have access to.

Vendor ID is already present (VENDOR_QUALCOMM = 0x5143). We can extend the current Device struct in RenderingContextDriver to provide the device ID. D3D12 has this information and can return it easily. I'd ping @stuartcarnie about Metal, but it might or might not need this as the set of hardware it runs on is more limited.

@darksylinc
Copy link
Contributor

This is a vulkan-specific workaround though. Shouldn't the flag be set at the Vk implementation level?

@DarioSamo
Copy link
Contributor Author

DarioSamo commented May 7, 2024

This is a vulkan-specific workaround though. Shouldn't the flag be set at the Vk implementation level?

It's a bit of a mixed bag as the workaround must be implemented at a level above the driver due to how the ARG works, so detecting the workaround could be one of two possibilities:

  • We add something to the RHI's interface to check if the workaround is required and only the Vulkan driver implements this check or
  • We just check if it's using the Vulkan RDD along with the device name/id detection.

@darksylinc
Copy link
Contributor

I think these workarounda should be centralized in a global section so they can be documented and analyzed if they're still needed.

e.g.

struct Workarounds
{
   // Explanation about the workaround.
   bool adreno_compute_clip = false;
};

Workarounds workarounds;

Thinking about proper design is pointless IMHO because by its very nature workarounds can be anything, anywhere and cross multiple isolation boundaries.

Thus a global bool is the best choice, all in one place.

@DarioSamo DarioSamo force-pushed the adreno_workaround branch from a830e16 to 94aef4b Compare May 9, 2024 18:28
@DarioSamo
Copy link
Contributor Author

DarioSamo commented May 9, 2024

I've pushed a new version of the workaround detection based on driver version instead. For this I moved the logic specifically to be a part of the Vulkan driver. The workarounds structure is now part of RenderingContextDriver's Device structure, which makes it pretty straightforward to only enable it where it's relevant and with as much information as possible.

The only additional question here is if we want to specifiy a driver version range or just enable this on all devices previous to this version. Right now, this PR is implementing the latter.

Tagging @clayjohn and @TCROC to test again if possible to confirm the detection works.

@TCROC
Copy link
Contributor

TCROC commented May 11, 2024

I've pushed a new version of the workaround detection based on driver version instead. For this I moved the logic specifically to be a part of the Vulkan driver. The workarounds structure is now part of RenderingContextDriver's Device structure, which makes it pretty straightforward to only enable it where it's relevant and with as much information as possible.

The only additional question here is if we want to specifiy a driver version range or just enable this on all devices previous to this version. Right now, this PR is implementing the latter.

Tagging @clayjohn and @TCROC to test again if possible to confirm the detection works.

I tested and it worked! :)

@DarioSamo DarioSamo force-pushed the adreno_workaround branch from bc72c0c to d5789e0 Compare May 13, 2024 13:20
@akien-mga akien-mga requested a review from clayjohn May 13, 2024 15:10
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Let's go ahead and merge.

For maintainers, this shouldn't be cherrypicked

@akien-mga akien-mga merged commit db3c4a4 into godotengine:master May 14, 2024
16 checks passed
@akien-mga
Copy link
Member

Thanks!

@Alex2782
Copy link
Member

@clayjohn

p_device_properties.driverVersion < VK_MAKE_VERSION(512, 503, 0)

Somewhere I read that the VK_VERSION functions are intended only for the Vulkan API version. Device manufacturers often use their own bitshifting functions. This comparison is probably always true on every Adreno 6xx device (?)

On my Adreno 6xx (Android 13):

VARIANT = 4
MAJOR = 0
MINOR = 502
PATCH = 0

I also tried your itos solution to be sure: #92611 (comment)

//#ifdef ANDROID_ENABLED
	print_line("======== Workarounds ========");
	print_line("avoid_compute_after_draw: ", r_device.workarounds.avoid_compute_after_draw);
	print_line("avoid_render_graph_reorder: ", r_device.workarounds.avoid_render_graph_reorder);
	print_line("-----------------------------");
	print_line("name: ", r_device.name);
	print_line("vendor: ", r_device.vendor);
	print_line("deviceID: ", p_device_properties.deviceID);
	uint32_t variant = VK_API_VERSION_VARIANT(p_device_properties.driverVersion);
	uint32_t major = VK_API_VERSION_MAJOR(p_device_properties.driverVersion);
	uint32_t minor = VK_API_VERSION_MINOR(p_device_properties.driverVersion);
	uint32_t patch = VK_API_VERSION_PATCH(p_device_properties.driverVersion);
	print_line("driverVersion: ", vformat("%d.%d.%d.%d", variant, major, minor, patch));

	uint32_t version = p_device_properties.driverVersion;
	print_line("clayjohn driverVersion: ", vformat("%s.%s.%s.%s", 
		itos(VK_API_VERSION_VARIANT(version)), 
		itos(VK_API_VERSION_MAJOR(version)),
		itos(VK_API_VERSION_MINOR(version)), 
		itos(VK_API_VERSION_PATCH(version))));
//#endif // ANDROID_ENABLED
07-31 18:45:26.607  5605  5650 I godot   : Godot Engine v4.3.beta.custom_build.b87c5450b (2024-07-31 00:17:26 UTC) - https://godotengine.org
07-31 18:45:26.672  5605  5650 I godot   : ======== Workarounds ========
07-31 18:45:26.672  5605  5650 I godot   : avoid_compute_after_draw:  true
07-31 18:45:26.672  5605  5650 I godot   : avoid_render_graph_reorder:  false
07-31 18:45:26.673  5605  5650 I godot   : -----------------------------
07-31 18:45:26.673  5605  5650 I godot   : name:  Adreno (TM) 650
07-31 18:45:26.673  5605  5650 I godot   : vendor:  20803
07-31 18:45:26.673  5605  5650 I godot   : deviceID:  100990978
07-31 18:45:26.673  5605  5650 I godot   : driverVersion:  4.0.502.0
07-31 18:45:26.673  5605  5650 I godot   : clayjohn driverVersion:  4.0.502.0
07-31 18:45:26.674  5605  5650 I godot   : Vulkan 1.1.128 - Forward Mobile - Using Device #0: Qualcomm - Adreno (TM) 650

If it's wrong, maybe the solution is somewhere there: https://github.com/SaschaWillems/VulkanCapsViewer
app for gpuinfo: https://vulkan.gpuinfo.org/listdevices.php?platform=android

@clayjohn
Copy link
Member

@Alex2782 Here is the code that GPUinfo uses https://github.com/SaschaWillems/vulkan.gpuinfo.org/blob/1e6ca6e3c0763daabd6a101b860ab4354a07f5d3/functions.php#L294

It looks like NVidia and one other vendor differ from the standard Vulkan mapping in a known way. He just falls back to the Vulkan mapping for everyone else.

In this case, it looks like it might be using the old vulkan mapping (which excludes the variant version)

I ran a quick test with the old versions vs. the new versions

std::cout << VK_MAKE_VERSION(512, 503, 0) <<std::endl;
std::cout << VK_VERSION_MAJOR(2149543936) <<std::endl;
std::cout << VK_VERSION_MINOR(2149543936) <<std::endl;
std::cout << VK_VERSION_PATCH(2149543936) <<std::endl;
    
std::cout << VK_MAKE_API_VERSION(0, 512, 503, 0) <<std::endl;
std::cout << VK_API_VERSION_VARIANT(2149543936) <<std::endl;
std::cout << VK_API_VERSION_MAJOR(2149543936) <<std::endl;
std::cout << VK_API_VERSION_MINOR(2149543936) <<std::endl;
std::cout << VK_API_VERSION_PATCH(2149543936) <<std::endl;

This prints:

2149543936
512
503
0
2149543936
4
0
503
0

In hindsight, we are using the old VK_MAKE_VERSION instead of VK_MAKE_API_VERSION, so we should have been using the old functions anyway.

@darksylinc
Copy link
Contributor

Somewhere I read that the VK_VERSION functions are intended only for the Vulkan API version. Device manufacturers often use their own bitshifting functions. This comparison is probably always true on every Adreno 6xx device (?)

That is correct, VK_VERSION_* family of macros are meant to be used against Vulkan API versions, not driver versions.

However Qualcomm follows the same convention of 10 bits for major (>> 22), 10 for minor (>> 12), and 12 for patch (& 0xfff).

Later Vulkan added VK_API_VERSION_VARIANT which took away 3 bits from major. For this reason VK_MAKE_VERSION() is deprecated in favour of VK_MAKE_API_VERSION().

that's where the confusion is going on:

  • Qualcomm's Driver version is 512.502.0
  • 512 is the 10th bit of 'major' set.
  • But later major was split into 3 bits for variant and 7 bits for major.
  • 3rd bit of 'variant' is now set, which is 4. And all the other bits are 0.
  • So 512.502.0 becomes 4.0.502.0

Takeaways:

  • Current Godot's code using VK_MAKE_VERSION works, but it's confusion prone.
  • Godot should not use VK_MAKE_VERSION but rather its own function that handles each driver's shenanigans.
  • VK_API_VERSION_VARIANT / VK_API_VERSION_MAJOR / etc should not be used to decode driver versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants