Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in shadow test with Vulkan backend on Fedora #3377

Open
jimblandy opened this issue Jan 13, 2023 · 8 comments
Open

SIGSEGV in shadow test with Vulkan backend on Fedora #3377

jimblandy opened this issue Jan 13, 2023 · 8 comments
Labels
api: vulkan Issues with Vulkan external: upstream Issues happening in lower level APIs or platforms

Comments

@jimblandy
Copy link
Member

The following command crashes with a SIGSEGV on my machine:

$ WGPU_BACKEND=vulkan cargo test -p wgpu --example shadow
Stack trace

(gdb) where
#0  0x00007ffff4b6bff3 in memcpy (__len=40, __src=0x7ffff76754c0, __dest=0x0) at /usr/include/bits/string_fortified.h:29
#1  vk_common_CmdBeginDebugUtilsLabelEXT (_commandBuffer=0x7ffff0e7cfd0, pLabelInfo=<optimized out>) at ../src/vulkan/runtime/vk_debug_utils.c:200
#2  0x00007fffe880de8b in DispatchCmdBeginDebugUtilsLabelEXT () at /usr/src/debug/vulkan-validation-layers-1.3.216.0-2.fc37.x86_64/layers/generated/layer_chassis_dispatch.cpp:7971
#3  vulkan_layer_chassis::CmdBeginDebugUtilsLabelEXT () at /usr/src/debug/vulkan-validation-layers-1.3.216.0-2.fc37.x86_64/layers/generated/chassis.cpp:10034
#4  0x0000555555e133fe in ash::extensions::ext::debug_utils::DebugUtils::cmd_begin_debug_utils_label (self=0x7ffff0003608, command_buffer=..., label=0x7ffff76755f0) at /home/jimb/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/ash-0.37.2+1.3.238/src/extensions/ext/debug_utils.rs:71
#5  0x0000555555d5adb1 in wgpu_hal::vulkan::command::{impl#2}::begin_debug_marker (self=0x7ffff0e1d318, group_label=...) at wgpu-hal/src/vulkan/command.rs:562
#6  0x0000555555b9c591 in wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::command_encoder_push_debug_group<wgpu_core::hub::IdentityManagerFactory, wgpu_hal::vulkan::Api> (self=0x7ffff087a420, encoder_id=..., label=...) at wgpu-core/src/command/mod.rs:392
#7  0x00005555558f8b62 in wgpu::backend::direct::{impl#7}::command_encoder_push_debug_group (self=0x7ffff087a420, encoder=0x7ffff7675988, encoder_data=0x7ffff0e3e700, label=...) at wgpu/src/backend/direct.rs:1976
#8  0x00005555558e0f52 in wgpu::context::{impl#5}::command_encoder_push_debug_group<wgpu::backend::direct::Context> (self=0x7ffff087a420, encoder=0x7ffff7675e18, encoder_data=..., label=...) at wgpu/src/context.rs:2705
#9  0x000055555589c5c2 in wgpu::CommandEncoder::push_debug_group (self=0x7ffff7675e00, label=...) at wgpu/src/lib.rs:2839
#10 0x0000555555659d39 in shadow::{impl#2}::render (self=0x7ffff7676740, view=0x7ffff7676640, device=0x7ffff7676db0, queue=0x7ffff7676e58, _spawner=0x7ffff7676550) at wgpu/examples/shadow/main.rs:753
#11 0x000055555566088e in shadow::framework::test::{closure#0}<shadow::Example> (ctx=...) at wgpu/examples/shadow/../framework.rs:551
#12 0x0000555555678ccf in shadow::framework::test_common::initialize_test::{closure#1}<shadow::framework::test::{closure_env#0}<shadow::Example>> () at wgpu/examples/shadow/../../tests/common/mod.rs:295
#13 0x000055555565d503 in core::panic::unwind_safe::{impl#23}::call_once<(), shadow::framework::test_common::initialize_test::{closure_env#1}<shadow::framework::test::{closure_env#0}<shadow::Example>>> (self=..., _args=()) at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panic/unwind_safe.rs:271
#14 0x0000555555678f5f in std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<shadow::framework::test_common::initialize_test::{closure_env#1}<shadow::framework::test::{closure_env#0}<shadow::Example>>>, ()> (data=0x7ffff76779c0) at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:483
#15 0x000055555567900b in __rust_try ()
#16 0x0000555555678e3c in std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<shadow::framework::test_common::initialize_test::{closure_env#1}<shadow::framework::test::{closure_env#0}<shadow::Example>>>> (f=...) at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:447
#17 0x000055555566a063 in std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<shadow::framework::test_common::initialize_test::{closure_env#1}<shadow::framework::test::{closure_env#0}<shadow::Example>>>, ()> (f=...) at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panic.rs:137
#18 0x0000555555678130 in shadow::framework::test_common::initialize_test<shadow::framework::test::{closure_env#0}<shadow::Example>> (parameters=..., test_function=...) at wgpu/examples/shadow/../../tests/common/mod.rs:295
#19 0x00005555556603b7 in shadow::framework::test<shadow::Example> (params=...) at wgpu/examples/shadow/../framework.rs:509
#20 0x000055555565aa5b in shadow::shadow () at wgpu/examples/shadow/main.rs:849
#21 0x00005555556756c7 in shadow::shadow::{closure#0} () at wgpu/examples/shadow/main.rs:848
#22 0x00005555556657d5 in core::ops::function::FnOnce::call_once<shadow::shadow::{closure_env#0}, ()> () at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/ops/function.rs:251
#23 0x000055555584cf7f in core::ops::function::FnOnce::call_once<fn() -> core::result::Result<(), alloc::string::String>, ()> () at library/core/src/ops/function.rs:251
#24 test::__rust_begin_short_backtrace<core::result::Result<(), alloc::string::String>, fn() -> core::result::Result<(), alloc::string::String>> () at library/test/src/lib.rs:599
#25 0x000055555581da5c in test::run_test::{closure#1} () at library/test/src/lib.rs:590
#26 core::ops::function::FnOnce::call_once<test::run_test::{closure_env#1}, ()> () at library/core/src/ops/function.rs:251
#27 0x000055555584bfb8 in alloc::boxed::{impl#45}::call_once<(), (dyn core::ops::function::FnOnce<(), Output=core::result::Result<(), alloc::string::String>> + core::marker::Send), alloc::alloc::Global> () at library/alloc/src/boxed.rs:1987
#28 core::panic::unwind_safe::{impl#23}::call_once<core::result::Result<(), alloc::string::String>, alloc::boxed::Box<(dyn core::ops::function::FnOnce<(), Output=core::result::Result<(), alloc::string::String>> + core::marker::Send), alloc::alloc::Global>> () at library/core/src/panic/unwind_safe.rs:271
#29 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<alloc::boxed::Box<(dyn core::ops::function::FnOnce<(), Output=core::result::Result<(), alloc::string::String>> + core::marker::Send), alloc::alloc::Global>>, core::result::Result<(), alloc::string::String>> () at library/std/src/panicking.rs:483
#30 std::panicking::try<core::result::Result<(), alloc::string::String>, core::panic::unwind_safe::AssertUnwindSafe<alloc::boxed::Box<(dyn core::ops::function::FnOnce<(), Output=core::result::Result<(), alloc::string::String>> + core::marker::Send), alloc::alloc::Global>>> () at library/std/src/panicking.rs:447
#31 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<alloc::boxed::Box<(dyn core::ops::function::FnOnce<(), Output=core::result::Result<(), alloc::string::String>> + core::marker::Send), alloc::alloc::Global>>, core::result::Result<(), alloc::string::String>> () at library/std/src/panic.rs:137
#32 test::run_test_in_process () at library/test/src/lib.rs:622
#33 test::run_test::run_test_inner::{closure#0} () at library/test/src/lib.rs:516
#34 0x0000555555817fe4 in test::run_test::run_test_inner::{closure#1} () at library/test/src/lib.rs:543
#35 std::sys_common::backtrace::__rust_begin_short_backtrace<test::run_test::run_test_inner::{closure_env#1}, ()> () at library/std/src/sys_common/backtrace.rs:121
#36 0x000055555581d865 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<test::run_test::run_test_inner::{closure_env#1}, ()> () at library/std/src/thread/mod.rs:551
#37 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::run_test_inner::{closure_env#1}, ()>> () at library/core/src/panic/unwind_safe.rs:271
#38 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::run_test_inner::{closure_env#1}, ()>>, ()> () at library/std/src/panicking.rs:483
#39 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::run_test_inner::{closure_env#1}, ()>>> () at library/std/src/panicking.rs:447
#40 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::run_test_inner::{closure_env#1}, ()>>, ()> () at library/std/src/panic.rs:137
#41 std::thread::{impl#0}::spawn_unchecked_::{closure#1}<test::run_test::run_test_inner::{closure_env#1}, ()> () at library/std/src/thread/mod.rs:550
#42 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<test::run_test::run_test_inner::{closure_env#1}, ()>, ()> () at library/core/src/ops/function.rs:251
#43 0x00005555561bed73 in alloc::boxed::{impl#45}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1987
#44 alloc::boxed::{impl#45}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1987
#45 std::sys::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/unix/thread.rs:108
#46 0x00007ffff7c3614d in start_thread (arg=<optimized out>) at pthread_create.c:442
#47 0x00007ffff7cb7a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Package versions installed

I'm running Fedora 37 on x86_64 with the following vulkan RPMs installed:

vulkan-loader-1.3.216.0-3.fc37.x86_64
vulkan-tools-1.3.216.0-2.fc37.x86_64
vulkan-validation-layers-1.3.216.0-2.fc37.x86_64
vulkan-headers-1.3.216.0-2.fc37.noarch
mesa-vulkan-drivers-22.3.3-1.fc37.x86_64
@jimblandy
Copy link
Member Author

Output from wgpu-info: wgpu-info-wgpu#2277.txt

@jimblandy jimblandy changed the title SIGSEGV in shadow example with Vulkan backend on Fedora SIGSEGV in shadow test with Vulkan backend on Fedora Jan 13, 2023
@jimblandy
Copy link
Member Author

Since the stack trace shows the validation layers doing a memcpy to a null pointer, this may be a bug in the validation layers. (thx to cwfitzgerald)

@cwfitzgerald cwfitzgerald added external: upstream Issues happening in lower level APIs or platforms api: vulkan Issues with Vulkan labels Jan 13, 2023
@jimblandy
Copy link
Member Author

I suspect that Mesa commit 662e05c9 is the fix for this.

@jimblandy
Copy link
Member Author

Since the stack trace shows the validation layers doing a memcpy to a null pointer, this may be a bug in the validation layers. (thx to cwfitzgerald)

It seems like, prior to the above fix, Mesa's vk_common_CmdEndDebugUtilsLabelEXT function, which is used to implement vkCmdEndDebugUtilsLabelEXT, blindly assumed that the label stack was non-empty, unconditionally subtracting the size of a VkDebugUtilsLabelEXT from command_buffer->labels.size. But the Vulkan spec says:

An application may open a debug label region in one command buffer and close it in another, or otherwise split debug label regions across multiple command buffers or multiple queue submissions.

This means that there's no guarantee that the specific command buffer we're calling vkCmdEndDebugUtilsLabelEXT on will have any labels. They could have been begun in a different command buffer.

Trying to remove the nonexistent label causes the command buffer's label buffer's byte size to underflow to (1 << 32) - sizeof(VkDebugUtilsLabelExt).

The next call to vk_common_CmdBeginDebugUtilsLabelEXT calls util_dynarray_grow_bytes, whose overflow checks notice that adding another element would overflow the size (back to zero), so it returns NULL. This is then used as the destination for the memcpy, causing a SIGSEGV.

The Mesa commit above changes all the command buffer and queue label functions to use a utility function that checks whether the label buffer is empty before popping.

@jimblandy
Copy link
Member Author

It's hard to tell when the fix will make it into a new Mesa release. From what I can tell, it won't be included in the 23.0 branch, and we'll have to wait until 23.1 comes out in April to see a fix.

@jimblandy
Copy link
Member Author

Hmm. When I use VK_LAYER_KHRONOS_validation built from the current Vulkan-ValidationLayers source (0bc70eac6), I get the following message. Is this our bug?

[2023-04-21T00:05:28Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkCmdEndDebugUtilsLabelEXT-commandBuffer-01912 (0x56146426)]
    	Validation Error: [ VUID-vkCmdEndDebugUtilsLabelEXT-commandBuffer-01912 ] Object 0: handle = 0x55c54716f620, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x56146426 | vkCmdEndDebugUtilsLabelEXT() called without a corresponding vkCmdBeginDebugUtilsLabelEXT first The Vulkan spec states: There must be an outstanding vkCmdBeginDebugUtilsLabelEXT command prior to the vkCmdEndDebugUtilsLabelEXT on the queue that commandBuffer is submitted to (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkCmdEndDebugUtilsLabelEXT-commandBuffer-01912)
[2023-04-21T00:05:28Z ERROR wgpu_hal::vulkan::instance] 	objects: (type: COMMAND_BUFFER, hndl: 0x55c54716f620, name: ?)
Segmentation fault (core dumped)

@jimblandy
Copy link
Member Author

Okay, maybe the second time's a charm: in mesa commit 1c64952e seems to fix another aspect of this. That fix is not present on the 23.0 branch, but it is present in main 92a7cba4f26 (2023-5-5), and the bug seems to be fixed there: I can run shadow there and only get the bogus VUID complaints, without the SIGSEGV.

@jimblandy
Copy link
Member Author

We can't require users to update their copies of Mesa, so we're going to need a workaround for this. It should suffice to drop debug markers on the floor on Vulkan for specific Mesa versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vulkan Issues with Vulkan external: upstream Issues happening in lower level APIs or platforms
Projects
None yet
Development

No branches or pull requests

2 participants