Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content_shell crashes with Vulkan error on startup with nvidia driver #863

Closed
phuang opened this issue Feb 25, 2022 · 41 comments · Fixed by #890
Closed

content_shell crashes with Vulkan error on startup with nvidia driver #863

phuang opened this issue Feb 25, 2022 · 41 comments · Fixed by #890

Comments

@phuang
Copy link
Contributor

phuang commented Feb 25, 2022

content_shell crashes with the latest vulkan loader + nvidia driver. Seems it is related to 0a19663

See https://bugs.chromium.org/p/chromium/issues/detail?id=1299378 for detail

@phuang phuang changed the title content_shell crashes with Vulkan error on startup content_shell crashes with Vulkan error on startup with nvidia driver Feb 25, 2022
@charles-lunarg
Copy link
Collaborator

Is this related to @null77 's comments on PR #856 ? (Since they appeared to be reported at near enough the same time).

@MarkY-LunarG
Copy link
Collaborator

I've made some changes, let me know if the latest master works for you.

@MarkY-LunarG
Copy link
Collaborator

Also, how do I see this issue in the future? Can I just pull down Angle and force the loader commit for it? Then how do I run a good test? I'd appreciate knowing so I can try to run future changes that touch Environment variables through this.

@null77
Copy link
Contributor

null77 commented Mar 1, 2022

The latest roll succeeded so it looks like your changes resolved the problem, thanks!

https://autoroll.skia.org/r/vulkan-deps-angle-autoroll

Can you send an email via internal channels to ask about a process for testing? We had discussed it before and haven't yet come up with a good solution.

@MarkY-LunarG
Copy link
Collaborator

Actually, my concern is this change: https://github.com/MarkY-LunarG/Vulkan-Loader/tree/add_new_env_vars

It's also modifying the environment variable code and adding a new additive environment variable. That's what I wanted to verify works for you all before I push it up.

@phuang
Copy link
Contributor Author

phuang commented Mar 15, 2022

Sorry for the late reply. I just tested the latest vulkan loader, the problem still happens.

@phuang
Copy link
Contributor Author

phuang commented Mar 15, 2022

I found the crash is at line

return icd_term->dispatch.GetPhysicalDeviceSurfaceSupportKHR(

The crash is in the NVidia driver(we see crash with Intel driver as well),
Seems the arguments for calling the driver method are invalid.

#0  0x00007f19e66c6f9d in  () at /lib/x86_64-linux-gnu/libnvidia-glcore.so.470.74
#1  0x00007f19da980c2d in DispatchGetPhysicalDeviceSurfaceSupportKHR() ()
    at ../../third_party/vulkan-deps/vulkan-validation-layers/src/layers/vk_layer_utils.h:426
#2  0x00007f19da888d7f in GetPhysicalDeviceSurfaceSupportKHR() ()
    at ../../third_party/vulkan-deps/vulkan-validation-layers/src/layers/generated/chassis.cpp:5528
#3  0x00007f19eccb18c5 in vkGetPhysicalDeviceSurfaceSupportKHR () at ../../third_party/vulkan-deps/vulkan-loader/src/loader/wsi.c:231
#4  0x00007f1a0f3822bf in operator() () at ../../gpu/vulkan/vulkan_function_pointers.h:91
#5  vkGetPhysicalDeviceSurfaceSupportKHR () at ../../gpu/vulkan/vulkan_function_pointers.h:503
#6  Initialize() () at ../../gpu/vulkan/vulkan_surface.cc:115
#7  0x00007f19f61439ed in Initialize() () at ../../components/viz/service/display_embedder/skia_output_device_vulkan.cc:281
#8  0x00007f19f61438f5 in Create() () at ../../components/viz/service/display_embedder/skia_output_device_vulkan.cc:42
#9  0x00007f19f6127797 in InitializeForVulkan() () at ../../components/viz/service/display_embedder/skia_output_surface_impl_on_gpu.cc:1668
#10 0x00007f19f611cb22 in Initialize() () at ../../components/viz/service/display_embedder/skia_output_surface_impl_on_gpu.cc:1513
#11 0x00007f19f611c734 in Create() () at ../../components/viz/service/display_embedder/skia_output_surface_impl_on_gpu.cc:280
#12 0x00007f19f6112546 in InitializeOnGpuThread() () at ../../components/viz/service/display_embedder/skia_output_surface_impl.cc:910

@null77
Copy link
Contributor

null77 commented Mar 17, 2022

@MarkY-LunarG can you look carefully at 0a19663 ? We're starting to get blocked on multiple because we aren't able to update the Khronos DEPS in Chromium for several months.

@MarkY-LunarG
Copy link
Collaborator

So, I'm running with Nvidia and Intel for non-Agile runs and don't see any issues. And this function call is performed by VkCube and works for me.

What in the above callstack is invalid, the physical device, the surface, both?

It honestly looks to me like you're callchain got messed up. There should be a call to terminator_GetPhysicalDeviceSurfaceSupportKHR before the driver and that's missing.

Also, you're source lines aren't matching with mine for the loader, what branch/tag are you building using?

@MarkY-LunarG
Copy link
Collaborator

Also, have you tried disabling the Nvidia layers? I think you mentioned there was at least one running. If you run with VK_LOADER_DEBUG=driver,layer you should see output and no how to turn any implicit layers off.

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

I am on commit 6b3cb37, however I added many printf() for debugging. so the line numbers are not meaningful. I uploaded a new stack with all fprintf() removed. And please also check the log with VK_LOADER_DEBUG=driver,layer

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/penghuang/sources/chromium/src/out/Release/chrome --type=gpu-process --no'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fadfcb4318d in ?? () from /lib/x86_64-linux-gnu/libnvidia-glcore.so.470.103.01
[Current thread is 1 (Thread 0x7fae0793a540 (LWP 93919))]
(gdb) bt
#0  0x00007fadfcb4318d in  () at /lib/x86_64-linux-gnu/libnvidia-glcore.so.470.103.01
#1  0x00007fadf0dfcc2d in DispatchGetPhysicalDeviceSurfaceSupportKHR() ()
    at ../../third_party/vulkan-deps/vulkan-validation-layers/src/layers/vk_layer_utils.h:426
#2  0x00007fadf0d04d7f in GetPhysicalDeviceSurfaceSupportKHR() ()
    at ../../third_party/vulkan-deps/vulkan-validation-layers/src/layers/generated/chassis.cpp:5528
#3  0x00007fae257bb2bf in operator() () at ../../gpu/vulkan/vulkan_function_pointers.h:91
#4  vkGetPhysicalDeviceSurfaceSupportKHR () at ../../gpu/vulkan/vulkan_function_pointers.h:503
#5  Initialize() () at ../../gpu/vulkan/vulkan_surface.cc:115
#6  0x00007fae0c57b9ed in Initialize() () at ../../components/viz/service/display_embedder/skia_output_device_vulkan.cc:281
#7  0x00007fae0c57b8f5 in Create() () at ../../components/viz/service/display_embedder/skia_output_device_vulkan.cc:42
#8  0x00007fae0c55f797 in InitializeForVulkan() () at ../../components/viz/service/display_embedder/skia_output_surface_impl_on_gpu.cc:1668
#9  0x00007fae0c554b22 in Initialize() () at ../../components/viz/service/display_embedder/skia_output_surface_impl_on_gpu.cc:1513
#10 0x00007fae0c554734 in Create() () at ../../components/viz/service/display_embedder/skia_output_surface_impl_on_gpu.cc:280
#11 0x00007fae0c54a546 in InitializeOnGpuThread() () at ../../components/viz/service/display_embedder/skia_output_surface_impl.cc:910

https://gist.github.com/phuang/ffaa0695fba55aedb736de6682398185

@MarkY-LunarG
Copy link
Collaborator

I just built Angle. Is there a simple repro case I can try?

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

Try ./out/Debug/angle_end2end_tests, I can reproduce the crash with it. See below log.
BTW, ANGLE builds and uses vulkan-loader in angle/third_party/vulkan-deps/vulkan-loader

penghuang@penghuang-linux:~/sources/angle$ ./out/Debug/angle_end2end_tests 
1 GPUs:
  0 - NVIDIA device id: 0x1382, revision id: 0xA2, system device id: 0x0
       Driver Vendor: Nvidia
       Driver Version: 470.103.01

Active GPU: 0

Optimus: false
AMD Switchable: false
Mac Switchable: false
Needs EAGL on Mac: false


Skipping tests using configuration ES2_OpenGLES_NoFixture because it is not available.
Skipping tests using configuration ES2_OpenGLES because it is not available.
Skipping tests using configuration ES3_OpenGLES because it is not available.
Skipping tests using configuration ES3_OpenGLES_NoFixture because it is not available.
Skipping tests using configuration ES3_1_OpenGLES because it is not available.
Skipping tests using configuration ES2_OpenGLES_EmulateCopyTexImage2DFromRenderbuffers because it is not available.
Skipping tests using configuration ES3_OpenGLES_EmulateCopyTexImage2DFromRenderbuffers because it is not available.
Skipping tests using configuration ES1_OpenGLES because it is not available.
Skipping tests using configuration ES3_1_OpenGLES_NoFixture because it is not available.
Skipping tests using configuration ES2_OpenGLES_EmulatedVAOs because it is not available.
Skipping tests using configuration ES3_OpenGLES_EmulatedVAOs because it is not available.
[==========] Running 26155 tests from 418 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from EGLAndroidFrameBufferTargetTest
[ RUN      ] EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan
No results file specified.

Signal 11 [Segmentation fault]:
Backtrace:
angle::PrintStackBacktrace() at crash_handler_posix.cpp:453
angle::Handler(int) at crash_handler_posix.cpp:616
__restore_rt at sigaction.c:?
vk_optimusGetDeviceProcAddr at ??:?
terminator_GetPhysicalDeviceSurfaceSupportKHR at wsi.c:261
/usr/bin/addr2line: '/home/penghuang/sources/angle/angledata/../libVkLayer_khronos_validation.so': No such file
vkGetPhysicalDeviceSurfaceSupportKHR at wsi.c:227
rx::RendererVk::selectPresentQueueForSurface(rx::DisplayVk*, VkSurfaceKHR_T*, unsigned int*) at RendererVk.cpp:2546
rx::WindowSurfaceVk::initializeImpl(rx::DisplayVk*) at SurfaceVk.cpp:825
rx::WindowSurfaceVk::initialize(egl::Display const*) at SurfaceVk.cpp:801
egl::Surface::initialize(egl::Display const*) at Surface.cpp:204
egl::Display::createWindowSurface(egl::Config const*, long, egl::AttributeMap const&, egl::Surface**) at Display.cpp:1240
egl::CreateWindowSurface(egl::Thread*, egl::Display*, egl::Config*, long, egl::AttributeMap const&) at egl_stubs.cpp:270
EGL_CreateWindowSurface at entry_points_egl_autogen.cpp:164
EGLWindow::initializeSurface(OSWindow*, angle::Library*, ConfigParameters const&) at EGLWindow.cpp:511
ANGLETestBase::ANGLETestSetUp() at ANGLETest.cpp:729
ANGLETestWithParam<angle::PlatformParameters>::SetUp() at ANGLETest.h:635
void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) at gtest.cc:2631
void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) at gtest.cc:2686
testing::Test::Run() at gtest.cc:2704
testing::TestInfo::Run() at gtest.cc:2888
testing::TestSuite::Run() at gtest.cc:3040
testing::internal::UnitTestImpl::RunAllTests() at gtest.cc:5898
bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) at gtest.cc:2631
bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) at gtest.cc:2686
testing::UnitTest::Run() at gtest.cc:5464
RUN_ALL_TESTS() at gtest.h:2492
angle::TestSuite::run() at TestSuite.cpp:1723
main at angle_end2end_tests_main.cpp:50
__libc_start_main at libc-start.c:332
_start at ??:?

@MarkY-LunarG
Copy link
Collaborator

I get this when I run that test:

Test skipped: !IsEGLDisplayExtensionEnabled(mDisplay, "EGL_ANDROID_framebuffer_target").

How do I force it to run?

@MarkY-LunarG
Copy link
Collaborator

Here's my full output for the one test:

((1e6643d55...))] $ ./out/Debug/angle_end2end_tests --gtest_filter=EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan
2 GPUs:
  0 - Intel device id: 0x3E9B, revision id: 0x0, system device id: 0x0
  1 - NVIDIA device id: 0x1F91, revision id: 0xA1, system device id: 0x0

Active GPU: 1

Optimus: true
AMD Switchable: false
Mac Switchable: false
Needs EAGL on Mac: false


Skipping tests using configuration ES2_OpenGLES_NoFixture because it is not available.
Skipping tests using configuration ES2_OpenGLES because it is not available.
Skipping tests using configuration ES3_OpenGLES because it is not available.
Skipping tests using configuration ES3_OpenGLES_NoFixture because it is not available.
Skipping tests using configuration ES3_1_OpenGLES because it is not available.
Skipping tests using configuration ES2_OpenGLES_EmulateCopyTexImage2DFromRenderbuffers because it is not available.
Skipping tests using configuration ES3_OpenGLES_EmulateCopyTexImage2DFromRenderbuffers because it is not available.
Skipping tests using configuration ES1_OpenGLES because it is not available.
Skipping tests using configuration ES3_1_OpenGLES_NoFixture because it is not available.
Skipping tests using configuration ES2_OpenGLES_EmulatedVAOs because it is not available.
Skipping tests using configuration ES3_OpenGLES_EmulatedVAOs because it is not available.
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

Note: Google Test filter = EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from EGLAndroidFrameBufferTargetTest
[ RUN      ] EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

Test skipped: !IsEGLDisplayExtensionEnabled(mDisplay, "EGL_ANDROID_framebuffer_target").
[       OK ] EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan (353 ms)
[----------] 1 test from EGLAndroidFrameBufferTargetTest (353 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (355 ms total)
[  PASSED  ] 1 test.

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

Could you please go to third_party/vulkan-deps/vulkan-loader and check if the vulkan-loader has the commit 0a19663 ?

And are you using open sourced nvidia driver? I am using the driver from https://www.nvidia.com/download/index.aspx.
And I am using Xserver instead of wayland.

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

I saw you have Intel GPU. Could you please try forcing to use the nvidia gpu (my removing intel vulkan driver so file)?

@MarkY-LunarG
Copy link
Collaborator

MarkY-LunarG commented Mar 18, 2022

If I force Nvidia (VK_ICD_FILENAMES or VK_DRIVER_FILES) I still see the skipped test with no errors. I am using the proprietary driver for Nvidia which uses "libGLX_nvidia.so.0" and is at API version 1.3.194.

I do see this warning message a few times when I turn on loader debugging:

WARNING | LAYER: loader_add_layer_properties: Can not find 'layer' object in manifest JSON file angledata/VkICD_mock_icd.json.  Skipping this file.

Here's the loader view of the vkCreateInstance and vkCreateDevice callstack from the attempts:

WARNING | LAYER: loader_add_layer_properties: Can not find 'layer' object in manifest JSON file angledata/VkICD_mock_icd.json.  Skipping this file.
DRIVER: Searching for driver manifest files
DRIVER:    In following folders:
DRIVER:       /usr/share/vulkan/icd.d/nvidia_icd.json
DRIVER:    Found the following files:
DRIVER:       /usr/share/vulkan/icd.d/nvidia_icd.json
DRIVER: Found ICD manifest file /usr/share/vulkan/icd.d/nvidia_icd.json, version "1.0.0"
LAYER | DEBUG: Loading layer library angledata/../libVkLayer_khronos_validation.so
LAYER | INFO: Insert instance layer VK_LAYER_KHRONOS_validation (angledata/../libVkLayer_khronos_validation.so)
LAYER: vkCreateInstance layer callstack setup to:
LAYER:    <Application>
LAYER:      ||
LAYER:    <Loader>
LAYER:      ||
LAYER:    VK_LAYER_KHRONOS_validation
LAYER:            Type: Explicit
LAYER:            Manifest: angledata/VkLayer_khronos_validation.json
LAYER:            Library:  angledata/../libVkLayer_khronos_validation.so
LAYER:      ||
LAYER:    <Drivers>

LAYER | DEBUG: Loading layer library angledata/../libVkLayer_khronos_validation.so
LAYER | INFO: Inserted device layer VK_LAYER_KHRONOS_validation (angledata/../libVkLayer_khronos_validation.so)
LAYER: vkCreateDevice layer callstack setup to:
LAYER:    <Application>
LAYER:      ||
LAYER:    <Loader>
LAYER:      ||
LAYER:    VK_LAYER_KHRONOS_validation
LAYER:            Type: Explicit
LAYER:            Manifest: angledata/VkLayer_khronos_validation.json
LAYER:            Library:  angledata/../libVkLayer_khronos_validation.so
LAYER:      ||
LAYER:    <Device>

Test skipped: !IsEGLDisplayExtensionEnabled(mDisplay, "EGL_ANDROID_framebuffer_target").

And for master, I am using commit ab207b0829a4ccd8c1bc2d47dc6c3c32afc4b7ca from Github's KhronosGroup/Vulkan-Loader where the original is v1.3.2081 from Chromium's mirror of KhronosGroup/Vulkan-Loader

@MarkY-LunarG
Copy link
Collaborator

I also tried forcing IsEGLDisplayExtensionEnabled to just return true, and I see errors, but reasonable ones:

INFO: EGL ERROR: eglGetConfigAttrib: EGL_ANDROID_framebuffer_target is not enabled.
../../src/tests/egl_tests/EGLAndroidFrameBufferTargetTest.cpp:38: Failure
Expected equality of these values:
  static_cast<EGLBoolean>(1)
    Which is: 1
  static_cast<EGLBoolean>(l_eglGetConfigAttrib(display, config, attrib, &value))
    Which is: 0
INFO: EGL ERROR: eglGetConfigAttrib: EGL_ANDROID_framebuffer_target is not enabled.
../../src/tests/egl_tests/EGLAndroidFrameBufferTargetTest.cpp:38: Failure
Expected equality of these values:
  static_cast<EGLBoolean>(1)
    Which is: 1
  static_cast<EGLBoolean>(l_eglGetConfigAttrib(display, config, attrib, &value))
    Which is: 0

Is there a way to turn on the extension legitimately?

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

The crash should happen before the !IsEGLDisplayExtensionEnabled(mDisplay, "EGL_ANDROID_framebuffer_target") checking. I synced vulkan-loader to the version before 0a19663 , I got the same error as your. So you don't need to focus on it.

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

I got loader log as below. There is a VK_LAYER_MESA_device_select layer in it, but it is not in yours. Maybe it is related. Do you know where does it come from? Could you please try install it?

LAYER | DEBUG: Loading layer library angledata/../libVkLayer_khronos_validation.so
LAYER | INFO: Insert instance layer VK_LAYER_KHRONOS_validation (angledata/../libVkLayer_khronos_validation.so)
LAYER | DEBUG: Loading layer library libVkLayer_MESA_device_select.so
LAYER | INFO: Insert instance layer VK_LAYER_MESA_device_select (libVkLayer_MESA_device_select.so)
LAYER: vkCreateInstance layer callstack setup to:
LAYER:    <Application>
LAYER:      ||
LAYER:    <Loader>
LAYER:      ||
LAYER:    VK_LAYER_MESA_device_select
LAYER:            Type: Implicit
LAYER:                Disable Env Var:  NODEVICE_SELECT
LAYER:            Manifest: /usr/share/vulkan/implicit_layer.d/VkLayer_MESA_device_select.json
LAYER:            Library:  libVkLayer_MESA_device_select.so
LAYER:      ||
LAYER:    VK_LAYER_KHRONOS_validation
LAYER:            Type: Explicit
LAYER:            Manifest: angledata/VkLayer_khronos_validation.json
LAYER:            Library:  angledata/../libVkLayer_khronos_validation.so
LAYER:      ||
LAYER:    <Drivers>

LAYER | DEBUG: Loading layer library angledata/../libVkLayer_khronos_validation.so
LAYER | INFO: Inserted device layer VK_LAYER_KHRONOS_validation (angledata/../libVkLayer_khronos_validation.so)
LAYER | DEBUG: Loading layer library libVkLayer_MESA_device_select.so
LAYER | INFO: Failed to find vkGetDeviceProcAddr in layer libVkLayer_MESA_device_select.so
LAYER: vkCreateDevice layer callstack setup to:
LAYER:    <Application>
LAYER:      ||
LAYER:    <Loader>
LAYER:      ||
LAYER:    VK_LAYER_KHRONOS_validation
LAYER:            Type: Explicit
LAYER:            Manifest: angledata/VkLayer_khronos_validation.json
LAYER:            Library:  angledata/../libVkLayer_khronos_validation.so
LAYER:      ||
LAYER:    <Device>

@MarkY-LunarG
Copy link
Collaborator

I've tried it both ways. I was just disabling things to try to make the scenario different to see if it triggered Here's my command line:

No mesa layer, Nvidia forced:

 NODEVICE_SELECT=1 VK_DRIVER_FILES=/usr/share/vulkan/icd.d/nvidia_icd.json VK_LOADER_DEBUG=layer,driver ./out/Debug/angle_end2end_tests --gtest_filter=EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan

Mesa layer no driver forced:

VK_LOADER_DEBUG=layer,driver ./out/Debug/angle_end2end_tests --gtest_filter=EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan

I've tried multiple combinations as well.

@phuang
Copy link
Contributor Author

phuang commented Mar 18, 2022

As my test, the problem is definitetly casued by commit 0a19663 . So could you check the commit, and maybe add more diagnose code on commit 0a19663 . So I can help you test it with my environment. Hopefully, you can find the root problem.

@MarkY-LunarG
Copy link
Collaborator

I've tried it both ways (enabled/disabled). You can disable it by defining the "Disable Env Var" for that implicit layer. So in my run I did it in a single line:

NODEVICE_SELECT=1  VK_LOADER_DEBUG=layer,driver ./out/Debug/angle_end2end_tests --gtest_filter=EGLAndroidFrameBufferTargetTest.MatchFramebufferTargetConfigs/ES2_Vulkan

Has anyone else on your end reproduced this separately?

@phuang
Copy link
Contributor Author

phuang commented Mar 21, 2022

I tried disabling the device_select or validation layer, they don't help.
We got 202 crash reports from chrome beta users before reverting vulkan-loader to the old version. Below are crashes distrubutions for Linux dists. Seems they are pretty new Linux dist versions. What's your Linux version and dist? Maybe it is related to the desktop environment.

1	Fedora Linux 35 (Workstation Edition)	41.58%	84
2	Ubuntu 21.04	26.73%	54
3	Debian GNU/Linux 11 (bullseye)	12.87%	26
4	Pop!_OS 21.10	12.38%	25
5	Ubuntu Jammy Jellyfish (development branch)	4.95%	10
6	Debian GNU/Linux rodete	0.99%	2
7	Ubuntu 21.10	0.50%	1
Total:	100.00%	202

@MarkY-LunarG
Copy link
Collaborator

MarkY-LunarG commented Mar 21, 2022

Weird, I'm on FC 34... Are the failures grouped with a particular GPU vendor?

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

FYI, I just reproduced the crash with vkcube as well.

Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/vkcube...
(No debugging symbols found in /usr/bin/vkcube)
(gdb) run
Starting program: /usr/bin/vkcube 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Selected GPU 0: NVIDIA GeForce GTX 745, type: 2

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff534218d in ?? () from /lib/x86_64-linux-gnu/libnvidia-glcore.so.470.103.01
(gdb) bt
#0  0x00007ffff534218d in ?? () from /lib/x86_64-linux-gnu/libnvidia-glcore.so.470.103.01
#1  0x00007ffff7e29570 in terminator_GetPhysicalDeviceSurfaceSupportKHR (physicalDevice=0x5555556aa3a0, queueFamilyIndex=0, 
    surface=0x5555558dad80, pSupported=0x55555591a790) at ../../third_party/vulkan-deps/vulkan-loader/src/loader/wsi.c:261
#2  0x00007ffff7e2936a in vkGetPhysicalDeviceSurfaceSupportKHR (physicalDevice=0x5555556aa4a0, queueFamilyIndex=0, 
    surface=0x5555558dad80, pSupported=0x55555591a790) at ../../third_party/vulkan-deps/vulkan-loader/src/loader/wsi.c:227
#3  0x00005555555589a5 in ?? ()
#4  0x00007ffff7be37fd in __libc_start_main (main=0x555555557770, argc=1, argv=0x7fffffffdfa8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdf98) at ../csu/libc-start.c:332
#5  0x000055555555a77a in ?? ()

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

I added some printf to print arguments for calling CreateXcbSurfaceKHR and GetPhysicalDeviceSurfaceSupportKHR (see log below). Looks like the loader created 4 surfaces with 4 different icd_term->dispatch. But the loader called GetPhysicalDeviceSurfaceSupportKHR with the third icd_term->dispatch and the first VkSurface. Do you know why?

diff --git a/loader/wsi.c b/loader/wsi.c
index 38067519c..172e96d9c 100644
--- a/loader/wsi.c
+++ b/loader/wsi.c
@@ -258,6 +258,10 @@ VKAPI_ATTR VkResult VKAPI_CALL terminator_GetPhysicalDeviceSurfaceSupportKHR(VkP
 
     VkIcdSurface *icd_surface = (VkIcdSurface *)(uintptr_t)surface;
     if (NULL != icd_surface->real_icd_surfaces && (VkSurfaceKHR)NULL != icd_surface->real_icd_surfaces[phys_dev_term->icd_index]) {
+        fprintf(stderr, "EEE GetPhysicalDeviceSurfaceSupportKHR &icd_term->dispatch = %p\n", &icd_term->dispatch);
+        fprintf(stderr, "EEE GetPhysicalDeviceSurfaceSupportKHR phys_dev_term->phys_dev=%p \n", phys_dev_term->phys_dev);
+        fprintf(stderr, "EEE GetPhysicalDeviceSurfaceSupportKHR i=%d surface=%p\n", phys_dev_term->icd_index, icd_surface->real_icd_surfaces[phys_dev_term->icd_index]);
+        fprintf(stderr, "EEE GetPhysicalDeviceSurfaceSupportKHR queueFamilyIndex=%d \n", queueFamilyIndex);
         return icd_term->dispatch.GetPhysicalDeviceSurfaceSupportKHR(
             phys_dev_term->phys_dev, queueFamilyIndex, icd_surface->real_icd_surfaces[phys_dev_term->icd_index], pSupported);
     }
@@ -833,6 +837,8 @@ VKAPI_ATTR VkResult VKAPI_CALL terminator_CreateXcbSurfaceKHR(VkInstance instanc
                 if (VK_SUCCESS != vkRes) {
                     goto out;
                 }
+                fprintf(stderr, "EEE CreateXcbSurfaceKHR &icd_term->dispatch = %p\n", &icd_term->dispatch);
+                fprintf(stderr, "EEE CreateXcbSurfaceKHR i=%d surface=%p\n", i, pIcdSurface->real_icd_surfaces[i]);
             }
         }
     }
penghuang@penghuang-linux:~/sources/angle/out/Debug$ ninja -j 32 && LD_LIBRARY_PATH=. vkcube
[2/2] SOLINK ./libvulkan.so.1
Selected GPU 0: NVIDIA GeForce GTX 745, type: 2
EEE CreateXcbSurfaceKHR &icd_term->dispatch = 0x55fe63437980
EEE CreateXcbSurfaceKHR i=0 surface=0x55fe6343d1e0
EEE CreateXcbSurfaceKHR &icd_term->dispatch = 0x55fe63433050
EEE CreateXcbSurfaceKHR i=1 surface=0x55fe6343d2e0
EEE CreateXcbSurfaceKHR &icd_term->dispatch = 0x55fe632bcd40
EEE CreateXcbSurfaceKHR i=2 surface=0x55fe631cd538
EEE CreateXcbSurfaceKHR &icd_term->dispatch = 0x55fe632b7860
EEE CreateXcbSurfaceKHR i=3 surface=0x55fe63437530
EEE GetPhysicalDeviceSurfaceSupportKHR &icd_term->dispatch = 0x55fe632bcd40
EEE GetPhysicalDeviceSurfaceSupportKHR phys_dev_term->phys_dev=0x55fe63434c88 
EEE GetPhysicalDeviceSurfaceSupportKHR i=0 surface=0x55fe6343d1e0
EEE GetPhysicalDeviceSurfaceSupportKHR queueFamilyIndex=0 

@MarkY-LunarG
Copy link
Collaborator

That all looks fine. I've got a more drastic idea. I've been working at generating more output for driver info. What if you modified your "vulkan-loader" branch to use the 'gen_all_tramp_term' branch off of my fork: https://github.com/MarkY-LunarG/Vulkan-Loader/tree/gen_all_tramp_term

cd angle/third_party/vulkan-deps/vulkan-loader/src
git remote add marky [email protected]:MarkY-LunarG/Vulkan-Loader.git
git fetch --all --prune
git checkout gen_all_tramp_term

Then rebuild and before you run set the environment variable to output loader debugging to export VK_LOADER_DEBUG=driver,layer. Then run the crashing test. Maybe we'll see something more useful?

When done, you can always git remote rm marky and it should remove my fork on that system.

Do you know of anyone who's system we might be able to access locally here (Colorado) that can reproduce this issue? Or perhaps remote log into?

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

FYI, I also tried modify icd_index to 2 (phys_dev_term->icd_index = 2;) in terminator_GetPhysicalDeviceSurfaceSupportKHR(). And then vkcube is working fine. So the problem is becasue loader uses a wrong icd_index. I will try your branch and upload output shortly.

BTW, are you using chat.google.com or other IM? So maybe I can share my workstation access with you.

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

The log with gen_all_tramp_term branch
https://gist.github.com/phuang/0f199cc237b2830111f3782232997083

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

BTW, I also tested a standalone checkout Vulkan-Loader of your branch. It works fine. So the problem could be related to some build configuration.
See the log differ below.

$ diff -u log ~/sources/angle/out/Debug/log 
--- log	2022-03-24 11:21:16.588628355 -0400
+++ /home/penghuang/sources/angle/out/Debug/log	2022-03-24 11:19:18.377513169 -0400
@@ -2,7 +2,6 @@
 LAYER:               In following folders:
 LAYER:                  /home/penghuang/.config/vulkan/implicit_layer.d
 LAYER:                  /etc/xdg/vulkan/implicit_layer.d
-LAYER:                  /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/implicit_layer.d
 LAYER:                  /etc/vulkan/implicit_layer.d
 LAYER:                  /home/penghuang/.local/share/vulkan/implicit_layer.d
 LAYER:                  /usr/share/gnome/vulkan/implicit_layer.d
@@ -15,7 +14,6 @@
 DRIVER:              In following folders:
 DRIVER:                 /home/penghuang/.config/vulkan/icd.d
 DRIVER:                 /etc/xdg/vulkan/icd.d
-DRIVER:                 /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/icd.d
 DRIVER:                 /etc/vulkan/icd.d
 DRIVER:                 /home/penghuang/.local/share/vulkan/icd.d
 DRIVER:                 /usr/share/gnome/vulkan/icd.d
@@ -43,7 +41,6 @@
 DRIVER:              In following folders:
 DRIVER:                 /home/penghuang/.config/vulkan/icd.d
 DRIVER:                 /etc/xdg/vulkan/icd.d
-DRIVER:                 /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/icd.d
 DRIVER:                 /etc/vulkan/icd.d
 DRIVER:                 /home/penghuang/.local/share/vulkan/icd.d
 DRIVER:                 /usr/share/gnome/vulkan/icd.d
@@ -71,7 +68,6 @@
 LAYER:               In following folders:
 LAYER:                  /home/penghuang/.config/vulkan/implicit_layer.d
 LAYER:                  /etc/xdg/vulkan/implicit_layer.d
-LAYER:                  /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/implicit_layer.d
 LAYER:                  /etc/vulkan/implicit_layer.d
 LAYER:                  /home/penghuang/.local/share/vulkan/implicit_layer.d
 LAYER:                  /usr/share/gnome/vulkan/implicit_layer.d
@@ -84,7 +80,6 @@
 LAYER:               In following folders:
 LAYER:                  /home/penghuang/.config/vulkan/implicit_layer.d
 LAYER:                  /etc/xdg/vulkan/implicit_layer.d
-LAYER:                  /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/implicit_layer.d
 LAYER:                  /etc/vulkan/implicit_layer.d
 LAYER:                  /home/penghuang/.local/share/vulkan/implicit_layer.d
 LAYER:                  /usr/share/gnome/vulkan/implicit_layer.d
@@ -97,7 +92,6 @@
 DRIVER:              In following folders:
 DRIVER:                 /home/penghuang/.config/vulkan/icd.d
 DRIVER:                 /etc/xdg/vulkan/icd.d
-DRIVER:                 /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/icd.d
 DRIVER:                 /etc/vulkan/icd.d
 DRIVER:                 /home/penghuang/.local/share/vulkan/icd.d
 DRIVER:                 /usr/share/gnome/vulkan/icd.d
@@ -125,7 +119,6 @@
 LAYER:               In following folders:
 LAYER:                  /home/penghuang/.config/vulkan/implicit_layer.d
 LAYER:                  /etc/xdg/vulkan/implicit_layer.d
-LAYER:                  /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/implicit_layer.d
 LAYER:                  /etc/vulkan/implicit_layer.d
 LAYER:                  /home/penghuang/.local/share/vulkan/implicit_layer.d
 LAYER:                  /usr/share/gnome/vulkan/implicit_layer.d
@@ -138,7 +131,6 @@
 LAYER:               In following folders:
 LAYER:                  /home/penghuang/.config/vulkan/implicit_layer.d
 LAYER:                  /etc/xdg/vulkan/implicit_layer.d
-LAYER:                  /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/implicit_layer.d
 LAYER:                  /etc/vulkan/implicit_layer.d
 LAYER:                  /home/penghuang/.local/share/vulkan/implicit_layer.d
 LAYER:                  /usr/share/gnome/vulkan/implicit_layer.d
@@ -151,7 +143,6 @@
 LAYER:               In following folders:
 LAYER:                  /home/penghuang/.config/vulkan/explicit_layer.d
 LAYER:                  /etc/xdg/vulkan/explicit_layer.d
-LAYER:                  /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/explicit_layer.d
 LAYER:                  /etc/vulkan/explicit_layer.d
 LAYER:                  /home/penghuang/.local/share/vulkan/explicit_layer.d
 LAYER:                  /usr/share/gnome/vulkan/explicit_layer.d
@@ -164,7 +155,6 @@
 DRIVER:              In following folders:
 DRIVER:                 /home/penghuang/.config/vulkan/icd.d
 DRIVER:                 /etc/xdg/vulkan/icd.d
-DRIVER:                 /home/penghuang/sources/KhronosGroup/Vulkan-Loader/build/install/etc/vulkan/icd.d
 DRIVER:                 /etc/vulkan/icd.d
 DRIVER:                 /home/penghuang/.local/share/vulkan/icd.d
 DRIVER:                 /usr/share/gnome/vulkan/icd.d
@@ -203,44 +193,4 @@
 LAYER:                 ||
 LAYER:               <Drivers>
 
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Original order:
-INFO | DRIVER:               [0] llvmpipe (LLVM 12.0.1, 256 bits)
-INFO | DRIVER:               [1] NVIDIA GeForce GTX 745
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Order set to:
-INFO | DRIVER:               [0] NVIDIA GeForce GTX 745  
-INFO | DRIVER:               [1] llvmpipe (LLVM 12.0.1, 256 bits)  
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Original order:
-INFO | DRIVER:               [0] llvmpipe (LLVM 12.0.1, 256 bits)
-INFO | DRIVER:               [1] NVIDIA GeForce GTX 745
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Order set to:
-INFO | DRIVER:               [0] NVIDIA GeForce GTX 745  
-INFO | DRIVER:               [1] llvmpipe (LLVM 12.0.1, 256 bits)  
-INFO | DRIVER:    Copying old device 0 into new device 0
-INFO | DRIVER:    Copying old device 1 into new device 1
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Original order:
-INFO | DRIVER:               [0] llvmpipe (LLVM 12.0.1, 256 bits)
-INFO | DRIVER:               [1] NVIDIA GeForce GTX 745
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Order set to:
-INFO | DRIVER:               [0] NVIDIA GeForce GTX 745  
-INFO | DRIVER:               [1] llvmpipe (LLVM 12.0.1, 256 bits)  
-INFO | DRIVER:    Copying old device 0 into new device 0
-INFO | DRIVER:    Copying old device 1 into new device 1
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Original order:
-INFO | DRIVER:               [0] llvmpipe (LLVM 12.0.1, 256 bits)
-INFO | DRIVER:               [1] NVIDIA GeForce GTX 745
-INFO | DRIVER:    linux_read_sorted_physical_devices:  Order set to:
-INFO | DRIVER:               [0] NVIDIA GeForce GTX 745  
-INFO | DRIVER:               [1] llvmpipe (LLVM 12.0.1, 256 bits)  
-INFO | DRIVER:    Copying old device 0 into new device 0
-INFO | DRIVER:    Copying old device 1 into new device 1
 Selected GPU 0: NVIDIA GeForce GTX 745, type: 2
-DEBUG | LAYER:    Loading layer library libVkLayer_MESA_device_select.so
-INFO | LAYER:     Failed to find vkGetDeviceProcAddr in layer libVkLayer_MESA_device_select.so
-DRIVER | LAYER:   vkCreateDevice layer callstack setup to:
-DRIVER | LAYER:      <Application>
-DRIVER | LAYER:        ||
-DRIVER | LAYER:      <Loader>
-DRIVER | LAYER:        ||
-DRIVER | LAYER:      <Device>
-DRIVER | LAYER:          Using "NVIDIA GeForce GTX 745" using driver "libGLX_nvidia.so.0"

@null77
Copy link
Contributor

null77 commented Mar 24, 2022

@phuang I think the default loader build won't load SwiftShader the way Chromium uses it.

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

I found below change fix the problem for me. What is LOADER_ENABLE_LINUX_SORT?

penghuang@penghuang-linux:~/sources/angle/third_party/vulkan-deps/vulkan-loader/src$ git diff
diff --git a/BUILD.gn b/BUILD.gn
index eb528d458..4f7e57680 100644
--- a/BUILD.gn
+++ b/BUILD.gn
@@ -76,6 +76,7 @@ config("vulkan_loader_config") {
   defines = [
     "API_NAME=\"Vulkan\"",
     "USE_UNSAFE_FILE_SEARCH=1",
+    "LOADER_ENABLE_LINUX_SORT",
   ]
 
   if (is_win) {

@MarkY-LunarG
Copy link
Collaborator

Good, that's a good clue.

It creates a consistent device sorted result across all runs. It will sort discrete physical devices first based on PCI bus ID, then integrated devices (also by PCI bus ID if more than one) then software implementations. Previously, the order would vary based on the order read off of the directory using readdir (which is known to read results in random order).

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

Probably LOADER_ENABLE_LINUX_SORT should be removed and the loader should always sort devices?

@MarkY-LunarG
Copy link
Collaborator

No, the change is correct. You should add it if you want to create a PR. But the failure when sorting is disabled is still an issue that I'm looking into.

@phuang
Copy link
Contributor Author

phuang commented Mar 24, 2022

I see. Please review #889

@MarkY-LunarG
Copy link
Collaborator

I've reproduced the issue on Linux with sorting disabled. I'll keep this issue open until I solve that problem.

@MarkY-LunarG
Copy link
Collaborator

Just pushed up a fix as well as tests to catch this in the future. I verified that with sorting disabled and the fix not present the tests fail. They pass now with the fix (and sorting disabled) and with sorting enabled on Linux. All in #890.

Thanks for being patient @phuang!

@null77
Copy link
Contributor

null77 commented Mar 24, 2022

Great work Mark and Peng for pinpointing & fixing the issue!

MarkY-LunarG added a commit that referenced this issue Mar 24, 2022
The physical device terminator was missing the ICD index in the
non-sorted path.  This caused crashes in Angle before it was realized
that the sorting code was unintentionally disabled in that build
path.

Also, add tests to catch this case in the future in the WSI code, but
this required converting all the TEST_F tests to TEST since Gtest
didn't like mixing the 2 on my system.

Finally, fix a few WSI error messages in the loader which were
missing spaces.

Fixes #863 for non-sorting paths
charles-lunarg pushed a commit that referenced this issue Mar 25, 2022
The physical device terminator was missing the ICD index in the
non-sorted path.  This caused crashes in Angle before it was realized
that the sorting code was unintentionally disabled in that build
path.

Also, add tests to catch this case in the future in the WSI code, but
this required converting all the TEST_F tests to TEST since Gtest
didn't like mixing the 2 on my system.

Finally, fix a few WSI error messages in the loader which were
missing spaces.

Fixes #863 for non-sorting paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants