-
-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skinned Skeleton3D calling add_child()
with specific classes cause crash on Android with directional shadow in forward+/mobile
#90459
Comments
Note that I'm not sure if it's mono and double related. That's just what I tested with. |
Here is a stack trace captured from debugging in Android Studio:
|
As we discussed on RocketChat. It seems like the actual crash is happening within the Vulkan drivers on the device. Unfortunately, you can't really debug drivers on Android devices. So we need to figure out the cause of the crash from outside the driver. We can do that in two ways:
I think between the two, we should get a pretty good idea of if we are doing something wrong. If we aren't doing something wrong, then it should at least help point us towards a workaround. |
Sounds good. I'm currently working on building the validation layers from source so we can also step through those and look for interesting info. But I can right now check what is being passed to vkCmdDispatch. I'll set that break point real quick |
Here is the stacktrace with vulkan validation enabled: T-CROC - Wed Apr 10 2024 18 58 25 GMT-0400 (Eastern Daylight Time).txt And here are the key points of interest (copied in from the chat in the godot dev server): Very first error is just indicating that I built validation layers in debug mode. So we can just ignore that. Then we get to the first error of interest:
And then not too long after that, this repeats over and over again until it crashes:
|
The Vulkan profile for the affected device is here: https://vulkan.gpuinfo.org/displayreport.php?id=16301 I can reproduce the validation layer error when using this profile. It appears that the crash is coming from the skeleton compute shader. I think there is a good chance that the crash is happening due to an issue with this specific skeleton. But we can't rule out that the Vulkan error is somehow related. So let's try to fix the validation layer error first and then re-evaluate and see if the crash goes away. For the validation error, it seems that it comes from not supporting linear samplers with
|
For further testing, can you try with this diff:
|
I just tested with those changes. I still get the same error leading up to a crash:
|
Ok so I found that in rendering_device_driver_vulkan.cpp sampler_create_info.magFilter = p_state.mag_filter == SAMPLER_FILTER_LINEAR ? VK_FILTER_NEAREST : VK_FILTER_NEAREST;
sampler_create_info.minFilter = p_state.min_filter == SAMPLER_FILTER_LINEAR ? VK_FILTER_NEAREST : VK_FILTER_NEAREST; Here is the stack trace:
|
I just tested with a t-pose model from mixamo and it also crashes: https://www.mixamo.com So it appears that skeletons in general are causing godot to crash on Android. Should we update the title of the issue to reflect our findings? |
Here is the MRP with the humanoid from mixamo that also causes the crash: |
And this is the logcat trace:
|
I just tested this with godot's player model here: https://github.com/godotengine/godot-demo-projects/blob/master/3d/platformer/player/player.glb And it also crashed Here's the new MRP using it |
After A LOT of debugging and tireless help from @clayjohn , I am reasonably confident to have bisected and determined this PR to be the root cause of the crash: #87888 Yay!!! :) Now I can sleep peacefully! Edit: Literally, its almost midnight and I'm very tired 😅 Excited to see what new discoveries await tomorrow! |
@demolke I can't test it because blender is broken to begin with. Can you provide a project output as glTF to isolate the problem? If you do encounter a crash there, I'm guessing from the error log that it probably has nothing to do with this issue, and you'll need to send it as another issue. |
add_child()
cause crash on Android with directional shadow in forward+/mobile
@TCROC I have changed the title based on the results of the previous tests, can you confirm that this is true? I would expect a crash on the Android in mobile renderer when calling If you can confirm that in a disable_deprecated build or revert #87888 build (it would be great if you could check each separately), we can be sure of that. FYI, #87888 does add_child internally on enter_tree, which appears to be what caused #87888 to crash, but the fundamental cause of this issue is definitely #84976. |
Yes I think this is a good title change. With the exception of the first MRP where skeletons + mesh in general are crashing. Everything else is add_child. I will do those tests tomorrow as well! :) |
@TokageItLab Here are the results of the requested tests: I tested the "disable_deprecated" and "revert #87888" separately. Both gave the same results. They are as follows: ✅ = works
✅ MRP Mesh Without Skeleton Child: https://github.com/godotengine/godot/files/14971317/example-working-mesh.zip So I'm beginning to think the MRPs that causes the crash may actually be parent skeleton adding child with skeleton AND parent skeleton adding child of ParticlesGPU3D |
add_child()
cause crash on Android with directional shadow in forward+/mobileadd_child()
with specific classes cause crash on Android with directional shadow in forward+/mobile
Thanks, I've removed it for now and will file an issue once I understand it better |
It may not be a bad idea to file a new issue with the blender example causing a crash. I filed this issue even with only a broad understanding and then we have slowly narrowed in on it. I think its useful to at least have things tracked so they don't get lost or forgotten. |
I figure I should give my two cents since it is suspected #84976 might be the cause of the issue. I gave some of the MRPs linked here a test on desktop at least in However, there's some flags you can play around with if you feel it should be more paranoid about it: // When true, the command graph will attempt to reorder the rendering commands submitted by the user based on the dependencies detected from
// the commands automatically. This should improve rendering performance in most scenarios at the cost of some extra CPU overhead.
//
// This behavior can be disabled if it's suspected that the graph is not detecting dependencies correctly and more control over the order of
// the commands is desired (e.g. debugging).
#define RENDER_GRAPH_REORDER 1
// Synchronization barriers are issued between the graph's levels only with the necessary amount of detail to achieve the correct result. If
// it's suspected that the graph is not doing this correctly, full barriers can be issued instead that will block all types of operations
// between the synchronization levels. This setting will have a very negative impact on performance when enabled, so it's only intended for
// debugging purposes.
#define RENDER_GRAPH_FULL_BARRIERS 0 When reordering is disabled, the ARG will basically fall back to the behavior of the previous version and submit commands in their original order. When full barriers are enabled, then it'll basically be completely paranoid about synchronization. However, given we have no reliance on stuff like indirect drawing or dispatches, I'm very suspicious of synchronization being a problem here that could cause a crash. At worst I could imagine reordering having some effect significant enough for a crash. There's a lot of degrees of separation here between what was originally reported and the issue that could be causing a crash so it's a bit tough to understand the possible cause to me yet. Instancing elements from scripts or what they're attached to shouldn't cause any visible difference to the renderer, as no hierarchy is visible from its side nor are the objects treated differently depending on their origin. So I think pretty much anything should be reproducible on a static version of the scene if it exists. My suggestion for now would be to see if you see any behavior change whatsoever messing with these options, but I figured it'd be worth noting that the pattern has usually been that the introduction of the ARG has exposed some long-standing bugs due to minimizing the amount of barriers and transitions necessary. Either Godot's own issues were exposed or more obscure driver issues that end up requiring workarounds (like how @akien-mga's setup required the introduction of buffer barriers as the driver seemingly did not follow global barriers very well). |
@DarioSamo I'm curious. Were you able to reproduce the crashes with the MRPs? And did you run into any issues with those MRPs or were they good? |
I just tested the "Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d" on a pixel 4 with the following results: So clearly this is a mobile specific issue and something is getting through both our validation code and Vulkan's validation layers |
Thanks @clayjohn ! :) Should we update the label with "confirmed" instead of "needs testing"? Or is that not what those labels mean? |
Given Clay just got a crash I figure we can investigate it from here. And no, I did not run into any visual issues with them either. It's been brought to my attention from the call stack posted before the crash is actually at the driver level but not the GPU execution, so it's probably some situation the renderer causes on this particular driver. |
Interesting progress. I can't find anything unique about the dispatch that crashes. I have also tested out forcing the engine to rebind the pipeline and all uniform sets, but the crash persists. I noticed that both devices (the Samsung S10e and my device, the Pixel 4) have an Adreno 640 GPU. So I tested on a device with a Mali GPU and confirmed that the crash is not happening there. I have also tested on a device with an Adreno 530 GPU and it didn't crash there either. Looks like this is a driver bug after all that we are somehow triggering To refine my thoughts. Given that:
|
Small update on the ARG side of things. We have confirmed that this is indeed a driver bug in certain Adreno drivers (seems to be 6XX series). The exact bug is that if a scissor rect is set in a given command buffer, any compute commands dispatched after will crash. Technically compute should ignore the scissor rect, but on this driver version it doesn't. We are still investigating solutions. The reason this didn't crash previously is that in the mobile renderer the only compute work we do is skeletons and particles which are submitted at the beginning of the frame. So they always ran before the scissor rect was set. But now that we re-order work, it is possible for a graphics command to set the scissor rect before certain compute commands and cause this crash. |
Amazing work @clayjohn and everyone investigating!!! That sounds like an EXTREMELY hard edge case to detect! Very impressive! I look forward to seeing what solution or potential solutions we come up with! :) |
I've submitted #91514 for fixing this issue. At the moment however, the detection logic is not implemented. I pointed out the relevant line where it can be forced to use the workaround if you wish to check if it fixes the crash. It might be worth gathering the exact driver versions and device names that require this workaround. |
Judging from this report, it seems at least 620, 630, 640, and 650 are affected. I think that it is probably safe to just use the workaround for all Adreno 6XX devices |
Great work @DarioSamo and @clayjohn !! :) @DarioSamo is your PR ready to be tested yet? If so I can merge it into my fork and see how it goes. Or r you still touching some things up on it? |
Clay just added the device name detection for devices for Adreno 6XX devices. Give it a shot. |
I just tested and can confirm this PR fixes it! Great work everyone! Shoutout to @clayjohn @DarioSamo and @TokageItLab !! :) |
NOTE:
Significant research has gone into this issue. The initial report is a bit dated. See Summary of current findings below the Original Issue Report for more up to date details.
Original Issue Report
Tested versions
System information
Godot v4.3.dev.mono (a7b8602) - Pop!_OS 22.04 LTS - Wayland - Vulkan (Forward+) - dedicated Intel(R) Arc(tm) A750 Graphics (DG2) () - AMD Ryzen 7 5800X 8-Core Processor (16 Threads)
Issue description
When running on Android, if shadows are enabled on the directional light while my model is in the scene, it will crash. I tested with a simple CsgBox3d and it was fine, but my Blocky_Guy.blend model causes it to crash. Disabling shadows works fine tho.
Previous to the current master, it was actually working with shadows, but crashing when I eqipped additional items to the player's skeleton at runtime. So I am unsure if its specifically an issue with shadows, or an issue with how Godot handles my model on Android.
Steps to reproduce
Build godot with double and mono supportWe reproduced this without double and mono. Just standard godot causes it.This was specifically tested against Galaxy S10e model number: SM-G970U1
Edit: Also reproduced on Google Pixel
Minimal reproduction project (MRP)
example-crash.zip
Edit:
Here is an MRP using a standard t-pose from mixamo. So it does not appear to be specific to my model: #90459 (comment)
Edit 2:
Here is a current summary of the findings for easy reading
Summary of current findings
Renderers Tested:
forward+ = crash ❌
mobile = crash ❌
gl_compatibility = works ✅
Bug introducing PRs:
SkeletonModifier3D
as refactoring for nodes that may modifySkeleton3D
#87888MRP test summaries
MRP Details
Skeleton mesh + directional light + shadows
Pr that introduced the bug: #87888
Steps to reproduce:
Known workarounds:
add_child()
with specific classes cause crash on Android with directional shadow in forward+/mobile #90459 (comment)MRP with Blocky Ball model: https://github.com/godotengine/godot/files/14925776/example-crash.zip
MRP with Mixamo T Pose model: https://github.com/godotengine/godot/files/14948327/example-crash2.zip
MRP with Godot Character Model from examples repo: https://github.com/godotengine/godot/files/14951312/example-crash3.zip
IMPORTANT:
The following crash examples require #87888 to be reverted, fixed, or a commit checked out before it was in the repo. Otherwise, the following examples will crash on startup due to the fact they all require a skeleton and mesh present.
Skeleton mesh + directional light + shadows + changing skeleton + add_child in _process
Pr that introduced the bug: #84976
Steps to reproduce:
Known Workarounds
MRP with Blocky Ball model: https://github.com/godotengine/godot/files/14967690/example-crash-change-skeleton.zip
And here is the code for equipping:
Note that equipping in
_ready
works. Equipping in_process
after waiting for a second crashes. I tried equipping in_process
right away without a delay and that also worked.Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d
Pr that introduced the bug: #84976
Steps to reproduce:
MRP:
example-crash-particles-3d.zip
Code for equipping:
Note: In this situation, equipping in _ready and _process both cause a crash
Known Workarounds
Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh with skeleton
Pr that introduced the bug: #84976
Steps to reproduce:
MRP:
example-crash-equip-mesh.zip
Code for equipping:
Note: In this situation, equipping in _ready works
Known Workarounds
Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh without skeleton
This situation actually works! ✅
Here is an MRP demonstrating it working:
example-working-mesh.zip
So if the model being added as a child of a bone attachment doesn't itself have a skeleton, it works fine. In our game, we allow users to equip pets as "hats" for fun. We want to play cool animations with them. Hence the reason we are equipping items with skeletons.
The text was updated successfully, but these errors were encountered: