Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash because of unload_drivers_without_physical_devices() #1410

Closed
y-novikov opened this issue Dec 18, 2023 · 6 comments · Fixed by #1471
Closed

Crash because of unload_drivers_without_physical_devices() #1410

y-novikov opened this issue Dec 18, 2023 · 6 comments · Fixed by #1471
Assignees

Comments

@y-novikov
Copy link
Contributor

y-novikov commented Dec 18, 2023

Describe the bug
Chrome tests crash with 28f76ed

Environment (please complete the following information):

To Reproduce
Steps to reproduce the behavior:

  1. Checkout Chromium
  2. Apply https://chromium-review.googlesource.com/c/chromium/src/+/5111241
  3. Build browser_tests with GN args:
    dcheck_always_on = true
    devtools_skip_typecheck = false
    enable_backup_ref_ptr_feature_flag = true
    enable_dangling_raw_ptr_checks = true
    enable_dangling_raw_ptr_feature_flag = true
    ffmpeg_branding = "Chrome"
    is_component_build = false
    is_debug = false
    proprietary_codecs = true
    symbol_level = 0
  4. Run ./browser_tests --gtest_filter=OptimizationGuideKeyedServiceBrowserTest.MainToggleUpdatesSettingsCorrectly
  5. See error https://ci.chromium.org/ui/p/chromium/builders/try/linux-rel/1644275/overview:
Received signal 11 SI_KERNEL000000000000
 Possibly a General Protection Fault, can be due to a non-canonical address dereference. See "Intel 64 and IA-32 Architectures Software Developer’s Manual", Volume 1, Section 3.3.7.1.
#0 0x557a6eb26e02 base::debug::CollectStackTrace()
#1 0x557a6eb0ca83 base::debug::StackTrace::StackTrace()
#2 0x557a6eb267b1 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7fe717442520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4251f)
#4 0x7fe7098e3306 <unknown>
#5 0x7fe71478d888 <unknown>
#6 0x7fe71478d66e <unknown>
#7 0x557a689a394d dawn::native::vulkan::VulkanInstance::~VulkanInstance()
#8 0x557a689a3a8e dawn::native::vulkan::VulkanInstance::~VulkanInstance()
#9 0x557a689a54ee dawn::native::vulkan::Backend::~Backend()
#10 0x557a6890a15e std::__Cr::array<>::~array()
#11 0x557a6890a014 dawn::native::InstanceBase::~InstanceBase()
#12 0x557a6890a20e dawn::native::InstanceBase::~InstanceBase()
#13 0x557a6890a3c4 dawn::native::InstanceBase::DeleteThis()
#14 0x557a71d5d026 dawn::native::Instance::~Instance()
#15 0x557a77310c3b on_device_model::OnDeviceModelService::PreSandboxInit()
#16 0x557a6d78c22c content::UtilityMain()
#17 0x557a6deb0d08 content::RunZygote()
#18 0x557a6deb1c83 content::RunOtherNamedProcessTypeMain()
#19 0x557a6deb3812 content::ContentMainRunnerImpl::Run()
#20 0x557a6deafeea content::RunContentProcess()
#21 0x557a6deb01d6 content::ContentMain()
#22 0x557a6fc23f38 content::LaunchTestsInternal()
#23 0x557a6fc24728 content::LaunchTests()
#24 0x557a6e8f9c75 LaunchChromeTests()
#25 0x557a6e8ee805 main
#26 0x7fe717429d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
#27 0x7fe717429e40 __libc_start_main
#28 0x557a644eb02a _start
  r8: 0000000000000000  r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000206
 r12: 000019f000048c00 r13: 000019f000048df0 r14: 0000000000000000 r15: 000019f000194000
  di: 000019f000048df0  si: 0000000000000000  bp: 000019f0000b0c00  bx: 0000000000000000
  dx: cdcdcdcdcdcdcdcd  ax: cdcdcdcdcdcdcdcd  cx: 0000000000000000  sp: 00007ffe12d83630
  ip: 00007fe7098e3306 efl: 0000000000010246 cgf: 002b000000000033 erf: 0000000000000000
 trp: 000000000000000d msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]

VK_LOADER_DEBUG output
log.txt

Additional context
I was able to get a better stack in gdb, though it's pretty complicated to get it to attach to the right process.
The missing parts are:

#0  0x00007ffff1ac4116 in  () at /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so
#1  0x00007ffff22cd36f in terminator_DestroyDebugUtilsMessengerEXT
    (instance=0x23c0001a0000, messenger=0x23c000010300, pAllocator=0x0)
    at ../../third_party/vulkan-deps/vulkan-loader/src/loader/debug_utils.c:266
#2  0x00007ffff22ccfbe in debug_utils_DestroyDebugUtilsMessengerEXT
    (instance=0x23c0001a0000, messenger=0x23c000010300, pAllocator=0x0)
    at ../../third_party/vulkan-deps/vulkan-loader/src/loader/debug_utils.c:168
#3  0x000055557308395b in dawn::native::vulkan::VulkanInstance::~VulkanInstance() (this=0x23c000060000)
    at ../../third_party/dawn/src/dawn/native/vulkan/BackendVk.cpp:288
#4  0x00005555730839f9 in dawn::native::vulkan::VulkanInstance::~VulkanInstance() (this=0x23c000060000)
    at ../../third_party/dawn/src/dawn/native/vulkan/BackendVk.cpp:280

So, I think that the actual crash is in Mesa's libvulkan_lvp.so, but the change in #1374 now triggers it.
Also, commenting out code, the trigger is DestroyInstance() call in 28f76ed#diff-2777bd8d9c8897ff9c11c52c69dfab1851581f674a5122c6ff3d70a3a3f15e1bR6393.

@shrekshao maybe has a simpler repro with dawn_end2end_tests.

@charles-lunarg, could you PTAL?
This has high priority for us, since it blocks updating Vulkan dependencies in Chrome and Dawn, and we can't just skip the breaking commit, since it's near update to 1.3.273 headers.

@y-novikov y-novikov added the bug Something isn't working label Dec 18, 2023
@charles-lunarg
Copy link
Collaborator

I know the exact problem as we are seeing it internally as well.

@charles-lunarg
Copy link
Collaborator

Went ahead and reverted it - should have done so last week when the issue was identified internally, as it was clear this is a major bug. The fix is not trivial do’s didn’t see a quick fix applied.

@dneto0
Copy link

dneto0 commented Dec 18, 2023

Thanks! We saw this last week too but we were hesitant about filing; we wanted to find a more crisp reproducer.
Good luck with the fix. We're not waiting on it.

@charles-lunarg
Copy link
Collaborator

charles-lunarg commented Dec 18, 2023

Its a nasty case of "invalidated invariants are invariably infuriating".
The loader created quite a number of arrays based on the length of drivers, which was constant from vkCreateInstance to vkDestroyInstance, which meant the indices were stable. With unloading they are no longer stable, and thus the wrong elements or elements past the end were accessed.

Edit: forgot to mention but I'm removing the bug tag as the bad code is reverted.

@charles-lunarg charles-lunarg removed the bug Something isn't working label Dec 18, 2023
@tycho
Copy link

tycho commented Dec 20, 2023

Is there still some bug lingering here, or possibly some other bad commit? I was running into app-level crashes, and as a debugging step I tried doing a simple "vulkaninfo" against the libvulkan provided by the ANGLE tree (which is currently pointing to commit 40633a6). It seems to be doing very strange things indeed:

(gdb) bt
#0  terminator_CreateWaylandSurfaceKHR (instance=0x5555557f9c00, pCreateInfo=0x55555586ce38, 
    pAllocator=0x5550002ad289, pSurface=0x5555557fa1a0)
    at /usr/src/debug/vulkan-icd-loader/Vulkan-Loader-1.3.274/loader/wsi.c:722
#1  0x00007ffff7d14f7e in vkGetPhysicalDeviceFeatures2 (physicalDevice=0x5555557fa1a0, 
    pFeatures=0x55555586ce38)
    at ../../third_party/vulkan-deps/vulkan-loader/src/loader/trampoline.c:2650
#2  0x000055555556c95d in AppGpu::AppGpu (this=<optimized out>, inst=..., id=<optimized out>, 
    phys_device=<optimized out>, this=<optimized out>, inst=..., id=<optimized out>, 
    phys_device=<optimized out>)
    at /usr/src/debug/vulkan-tools/Vulkan-Tools-1.3.269/vulkaninfo/./vulkaninfo.h:1715
#3  0x0000555555565566 in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/src/debug/vulkan-tools/Vulkan-Tools-1.3.269/vulkaninfo/vulkaninfo.cpp:1177
(gdb) up
#1  0x00007ffff7d14f7e in vkGetPhysicalDeviceFeatures2 (physicalDevice=0x5555557fa1a0, 
    pFeatures=0x55555586ce38)
    at ../../third_party/vulkan-deps/vulkan-loader/src/loader/trampoline.c:2650
2650            disp->GetPhysicalDeviceFeatures2KHR(unwrapped_phys_dev, pFeatures);
(gdb) p disp->GetPhysicalDeviceFeatures2KHR
$5 = (PFN_vkGetPhysicalDeviceFeatures2KHR) 0x7ffff7ea9570 <terminator_CreateWaylandSurfaceKHR>

How on earth did the pointer for GetPhysicalDeviceFeatures2KHR end up pointing to terminator_CreateWaylandSurfaceKHR?

That's not the only pointer that's clearly wrong, either:

(gdb) p *disp
$7 = {
  GetPhysicalDeviceProcAddr = 0x0,
  CreateInstance = 0x0,
  DestroyInstance = 0x7ffff53b2c20 <device_select_DestroyInstance>,
  EnumeratePhysicalDevices = 0x7ffff53b3120 <device_select_EnumeratePhysicalDevices>,
  GetPhysicalDeviceFeatures = 0x7ffff7e8ecb0 <terminator_GetPhysicalDeviceFeatures>,
  GetPhysicalDeviceFormatProperties = 0x7ffff7e8ece0 <terminator_GetPhysicalDeviceFormatProperties>,
  GetPhysicalDeviceImageFormatProperties = 0x7ffff7e961e0 <terminator_GetPhysicalDeviceImageFormatProperties>,
  GetPhysicalDeviceProperties = 0x7ffff7e8ec20 <terminator_GetPhysicalDeviceProperties>,
  GetPhysicalDeviceQueueFamilyProperties = 0x7ffff7e8ec50 <terminator_GetPhysicalDeviceQueueFamilyProperties>,
  GetPhysicalDeviceMemoryProperties = 0x7ffff7e8ec80 <terminator_GetPhysicalDeviceMemoryProperties>,
  GetInstanceProcAddr = 0x7ffff53b2800 <get_instance_proc_addr>,
  CreateDevice = 0x0,
  EnumerateInstanceExtensionProperties = 0x0,
  EnumerateDeviceExtensionProperties = 0x7ffff7e948e0 <terminator_EnumerateDeviceExtensionProperties>,
  EnumerateInstanceLayerProperties = 0x0,
  EnumerateDeviceLayerProperties = 0x7ffff7e96240 <terminator_EnumerateDeviceLayerProperties>,
  GetPhysicalDeviceSparseImageFormatProperties = 0x7ffff7e8ed10 <terminator_GetPhysicalDeviceSparseImageFormatProperties>,
  EnumerateInstanceVersion = 0x0,
  EnumeratePhysicalDeviceGroups = 0x7ffff53b28e0 <device_select_EnumeratePhysicalDeviceGroups>,
  GetPhysicalDeviceFeatures2 = 0x7ffff7e96270 <terminator_GetPhysicalDeviceFeatures2>,
  GetPhysicalDeviceProperties2 = 0x7ffff7e96380 <terminator_GetPhysicalDeviceProperties2>,
  GetPhysicalDeviceFormatProperties2 = 0x7ffff7e964d0 <terminator_GetPhysicalDeviceFormatProperties2>,
  GetPhysicalDeviceImageFormatProperties2 = 0x7ffff7e965d0 <terminator_GetPhysicalDeviceImageFormatProperties2>,
  GetPhysicalDeviceQueueFamilyProperties2 = 0x7ffff7e966d0 <terminator_GetPhysicalDeviceQueueFamilyProperties2>,
  GetPhysicalDeviceMemoryProperties2 = 0x7ffff7e968b0 <terminator_GetPhysicalDeviceMemoryProperties2>,
  GetPhysicalDeviceSparseImageFormatProperties2 = 0x7ffff7e96980 <terminator_GetPhysicalDeviceSparseImageFormatProperties2>,
  GetPhysicalDeviceExternalBufferProperties = 0x7ffff7e96bc0 <terminator_GetPhysicalDeviceExternalBufferProperties>,
  GetPhysicalDeviceExternalFenceProperties = 0x7ffff7e96e00 <terminator_GetPhysicalDeviceExternalFenceProperties>,
  GetPhysicalDeviceExternalSemaphoreProperties = 0x7ffff7e96ce0 <terminator_GetPhysicalDeviceExternalSemaphoreProperties>,
  GetPhysicalDeviceToolProperties = 0x7ffff7e96f20 <terminator_GetPhysicalDeviceToolProperties>,
  DestroySurfaceKHR = 0x7ffff7ea5720 <terminator_DestroySurfaceKHR>,
  GetPhysicalDeviceSurfaceSupportKHR = 0x7ffff7ea9190 <terminator_GetPhysicalDeviceSurfaceSupportKHR>,
  GetPhysicalDeviceSurfaceCapabilitiesKHR = 0x7ffff7ea9260 <terminator_GetPhysicalDeviceSurfaceCapabilitiesKHR>,
  GetPhysicalDeviceSurfaceFormatsKHR = 0x7ffff7ea9340 <terminator_GetPhysicalDeviceSurfaceFormatsKHR>,
  GetPhysicalDeviceSurfacePresentModesKHR = 0x7ffff7ea9410 <terminator_GetPhysicalDeviceSurfacePresentModesKHR>,
  GetPhysicalDevicePresentRectanglesKHR = 0x7ffff7eaa4f0 <terminator_GetPhysicalDevicePresentRectanglesKHR>,
  GetPhysicalDeviceDisplayPropertiesKHR = 0x7ffff7ea9f80 <terminator_GetPhysicalDeviceDisplayPropertiesKHR>,
  GetPhysicalDeviceDisplayPlanePropertiesKHR = 0x7ffff7eaa000 <terminator_GetPhysicalDeviceDisplayPlanePropertiesKHR>,
  GetDisplayPlaneSupportedDisplaysKHR = 0x7ffff7eaa080 <terminator_GetDisplayPlaneSupportedDisplaysKHR>,
  GetDisplayModePropertiesKHR = 0x7ffff7eaa100 <terminator_GetDisplayModePropertiesKHR>,
  CreateDisplayModeKHR = 0x7ffff7eaa180 <terminator_CreateDisplayModeKHR>,
  GetDisplayPlaneCapabilitiesKHR = 0x7ffff7eaa200 <terminator_GetDisplayPlaneCapabilitiesKHR>,
  CreateDisplayPlaneSurfaceKHR = 0x7ffff7eaa290 <terminator_CreateDisplayPlaneSurfaceKHR>,
  CreateXcbSurfaceKHR = 0x7ffff7ea9ab0 <terminator_CreateXlibSurfaceKHR>,
  GetPhysicalDeviceXcbPresentationSupportKHR = 0x7ffff7ea9ce0 <terminator_GetPhysicalDeviceXlibPresentationSupportKHR>,
  GetPhysicalDeviceVideoCapabilitiesKHR = 0x7ffff7ea9810 <terminator_CreateXcbSurfaceKHR>,
  GetPhysicalDeviceVideoFormatPropertiesKHR = 0x7ffff7ea9a40 <terminator_GetPhysicalDeviceXcbPresentationSupportKHR>,
  GetPhysicalDeviceFeatures2KHR = 0x7ffff7ea9570 <terminator_CreateWaylandSurfaceKHR>,
  GetPhysicalDeviceProperties2KHR = 0x7ffff7ea97a0 <terminator_GetPhysicalDeviceWaylandPresentationSupportKHR>,
  GetPhysicalDeviceFormatProperties2KHR = 0x7ffff7e7ad20 <terminator_GetPhysicalDeviceVideoCapabilitiesKHR>,
  GetPhysicalDeviceImageFormatProperties2KHR = 0x7ffff7e7ad60 <terminator_GetPhysicalDeviceVideoFormatPropertiesKHR>,
  GetPhysicalDeviceQueueFamilyProperties2KHR = 0x7ffff7e96270 <terminator_GetPhysicalDeviceFeatures2>,
  GetPhysicalDeviceMemoryProperties2KHR = 0x7ffff7e96380 <terminator_GetPhysicalDeviceProperties2>,
  GetPhysicalDeviceSparseImageFormatProperties2KHR = 0x7ffff7e964d0 <terminator_GetPhysicalDeviceFormatProperties2>,
  EnumeratePhysicalDeviceGroupsKHR = 0x7ffff7e965d0 <terminator_GetPhysicalDeviceImageFormatProperties2>,
  GetPhysicalDeviceExternalBufferPropertiesKHR = 0x7ffff7e966d0 <terminator_GetPhysicalDeviceQueueFamilyProperties2>,
  GetPhysicalDeviceExternalSemaphorePropertiesKHR = 0x7ffff7e968b0 <terminator_GetPhysicalDeviceMemoryProperties2>,
  GetPhysicalDeviceExternalFencePropertiesKHR = 0x7ffff7e96980 <terminator_GetPhysicalDeviceSparseImageFormatProperties2>,
  EnumeratePhysicalDeviceQueueFamilyPerformanceQueryCountersKHR = 0x7ffff7e95470 <terminator_EnumeratePhysicalDeviceGroups>,
  GetPhysicalDeviceQueueFamilyPerformanceQueryPassesKHR = 0x7ffff7e96bc0 <terminator_GetPhysicalDeviceExternalBufferProperties>,
  GetPhysicalDeviceSurfaceCapabilities2KHR = 0x7ffff7e96ce0 <terminator_GetPhysicalDeviceExternalSemaphoreProperties>,
  GetPhysicalDeviceSurfaceFormats2KHR = 0x7ffff7e96e00 <terminator_GetPhysicalDeviceExternalFenceProperties>,
  GetPhysicalDeviceDisplayProperties2KHR = 0x7ffff7e7ada0 <terminator_EnumeratePhysicalDeviceQueueFamilyPerformanceQueryCountersKHR>,
  GetPhysicalDeviceDisplayPlaneProperties2KHR = 0x7ffff7e7ade0 <terminator_GetPhysicalDeviceQueueFamilyPerformanceQueryPassesKHR>,
  GetDisplayModeProperties2KHR = 0x7ffff7eaaaf0 <terminator_GetPhysicalDeviceSurfaceCapabilities2KHR>,
  GetDisplayPlaneCapabilities2KHR = 0x7ffff7eaad20 <terminator_GetPhysicalDeviceSurfaceFormats2KHR>,
  GetPhysicalDeviceFragmentShadingRatesKHR = 0x7ffff7eaa570 <terminator_GetPhysicalDeviceDisplayProperties2KHR>,
  GetPhysicalDeviceVideoEncodeQualityLevelPropertiesKHR = 0x7ffff7eaa710 <terminator_GetPhysicalDeviceDisplayPlaneProperties2KHR>,
  GetPhysicalDeviceCooperativeMatrixPropertiesKHR = 0x7ffff7eaa890 <terminator_GetDisplayModeProperties2KHR>,
  GetPhysicalDeviceCalibrateableTimeDomainsKHR = 0x7ffff7eaaa30 <terminator_GetDisplayPlaneCapabilities2KHR>,
  CreateDebugReportCallbackEXT = 0x7ffff7e7ae20 <terminator_GetPhysicalDeviceFragmentShadingRatesKHR>,
  DestroyDebugReportCallbackEXT = 0x7ffff7e7ae60 <terminator_GetPhysicalDeviceVideoEncodeQualityLevelPropertiesKHR>,
  DebugReportMessageEXT = 0x7ffff7e7aea0 <terminator_GetPhysicalDeviceCooperativeMatrixPropertiesKHR>,
  GetPhysicalDeviceExternalImageFormatPropertiesNV = 0x7ffff7e7aee0 <terminator_GetPhysicalDeviceCalibrateableTimeDomainsKHR>,
  ReleaseDisplayEXT = 0x7ffff7e723d0 <terminator_CreateDebugReportCallbackEXT>,
  GetPhysicalDeviceSurfaceCapabilities2EXT = 0x7ffff7e71ac0 <terminator_DestroyDebugReportCallbackEXT>,
  CreateDebugUtilsMessengerEXT = 0x7ffff7e72af0 <terminator_DebugReportMessageEXT>,
  DestroyDebugUtilsMessengerEXT = 0x7ffff7e71880 <terminator_GetPhysicalDeviceExternalImageFormatPropertiesNV>,
  SubmitDebugUtilsMessageEXT = 0x7ffff7e7a520 <terminator_ReleaseDisplayEXT>,
  GetPhysicalDeviceMultisamplePropertiesEXT = 0x7ffff7e7a560 <terminator_AcquireXlibDisplayEXT>,
  GetPhysicalDeviceCalibrateableTimeDomainsEXT = 0x7ffff7e7a5b0 <terminator_GetRandROutputDisplayEXT>,
  GetPhysicalDeviceToolPropertiesEXT = 0x7ffff7e7a3e0 <terminator_GetPhysicalDeviceSurfaceCapabilities2EXT>,
  GetPhysicalDeviceCooperativeMatrixPropertiesNV = 0x7ffff7e72140 <terminator_CreateDebugUtilsMessengerEXT>,
  GetPhysicalDeviceSupportedFramebufferMixedSamplesCombinationsNV = 0x7ffff7e71900 <terminator_DestroyDebugUtilsMessengerEXT>,
  CreateHeadlessSurfaceEXT = 0x7ffff7e78dc0 <terminator_SubmitDebugUtilsMessageEXT>,
  AcquireDrmDisplayEXT = 0x7ffff7e7af20 <terminator_GetPhysicalDeviceMultisamplePropertiesEXT>,
  GetDrmDisplayEXT = 0x7ffff7e7af60 <terminator_GetPhysicalDeviceCalibrateableTimeDomainsEXT>,
  GetPhysicalDeviceOpticalFlowImageFormatsNV = 0x7ffff7e77380 <terminator_GetPhysicalDeviceToolPropertiesEXT>
}

@tycho
Copy link

tycho commented Dec 20, 2023

Is there still some bug lingering here, or possibly some other bad commit?

Actually, nevermind. It seems something was loading /usr/lib/libvulkan.so (ignoring LD_LIBRARY_PATH) and so it ended up having two different vulkan loader libraries in the process with different dispatch tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants