Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive with semaphore lifetime tracking #5568

Closed
ShabbyX opened this issue Apr 4, 2023 · 9 comments
Closed

False positive with semaphore lifetime tracking #5568

ShabbyX opened this issue Apr 4, 2023 · 9 comments
Assignees
Labels
Bug Something isn't working Synchronization Synchronization Validation Object Issue WSI Window System Integration related issues

Comments

@ShabbyX
Copy link
Contributor

ShabbyX commented Apr 4, 2023

Since 27ba73e, ANGLE's been hitting a validation error saying that a semaphore being destroyed is in use by the command buffers.

I finally got around to looking into it, and the failure is coming from when ANGLE is destroying semaphores used for vkQueuePresentKHR. This is happening on surface destruction after a call to vkQueueWaitIdle. Additionally, if VK_EXT_swapchain_maintenance1 is supported, the present fences are waited on, if any.

On the platform I'm hitting this, VK_EXT_swapchain_maintenance1 is not supported. However, vkQueueWaitIdle is supposed to have been enough to destroy the semaphores (because there is no alternative). This might be a bug, or a race condition with the above change?

Context: anglebug.com/7729

Need to unsuppress VUID-vkDestroySemaphore-semaphore-01137 in RendererVk.cpp before running the test.

@spencer-lunarg spencer-lunarg added Bug Something isn't working Synchronization Synchronization Validation Object Issue WSI Window System Integration related issues labels Apr 4, 2023
@jeremyg-lunarg jeremyg-lunarg self-assigned this Apr 4, 2023
@jeremyg-lunarg
Copy link
Contributor

This is Issue #12 I suspect that the commit mentioned above changed timing so that this problem started happening in ANGLE.

Playing with the tests on a platform that doesn't provide VK_EXT_swapchain_maintenance1 I saw the following:

  • vkQueuePresentKHR(... semaphore_a)
  • vkAcquireNextImageKHR(..., fence_b)
  • a bunch of queue submissions
  • vkGetFenceStatus(fence_b)
  • vkDestroySemaphore(semaphore_a)
  • VUID-vkDestroySemaphore-semaphore-01137

I do NOT see ANGLE calling vkQueueWaitIdle() in the tests that hit this error. I agree that a call to that function should prevent this error. But I also think the call sequence I observed should also not trigger the error.

@jeremyg-lunarg
Copy link
Contributor

FYI @artem-lunarg

@ShabbyX
Copy link
Contributor Author

ShabbyX commented May 8, 2023

ANGLE used to call vkQueueWaitIdle(), but now it's just checking the fence of the last submission and waits on it if it's not ready. In the sequence above, the fence was already signaled so there's no wait.

But I also think the call sequence I observed should also not trigger the error.

Agreed, it's supposed to be the same as vkQueueWaitIdle().

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Aug 14, 2023

I built the latest angle and checked that 27ba73e from Sep 30, 2022, indeed, was the PR that introduced the issue or increased reproducibility of existing issue.

Moving forward, #4769 from Nov 18, 2022 by @jeremyg-lunarg fixes the issue and dEQP-EGL.functional.buffer_age.no_preserve.no_resize.odd_clear_clear_even_render passes without errors. I tested using VVL code from Nov 18 2022 before and after that PR, but also tested on the latest code to ensure we did not introduce regression. I also run the entire angle_deqp_egl_tests test suite to check if this error does not appear in other tests (with unsuppressed VUID) - it does not.

@ShabbyX could you confirm that the tests do not trigger VUID-vkDestroySemaphore-semaphore-01137 with unsuppressed VUID? In that case I'd like to close this ticket.

Regarding different scenario mentioned by Jeremy above, I will try to reproduce it, and, if it is reproduceable, then I'll open a separate issue.

I tested on the platform that does not support VK_EXT_swapchain_maintenance1.

And sorry for the delayed response.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Aug 14, 2023

UPDATE: managed to reproduce VUID-vkDestroySemaphore-semaphore-01137 in these tests:
dEQP-EGL.functional.swap_buffers_with_damage.no_resize.render (good reproducibility, happens regularly)
dEQP-EGL.functional.swap_buffers_with_damage.no_resize.clear_clear (good reproducibility, happens regularly)
dEQP-EGL.functional.query_context.get_current_display.rgba8888_window (hard to reproduce, happened once)
dEQP-EGL.functional.buffer_age.no_preserve.no_resize.odd_clear_clear_even_render (hard to reproduce, happened once)

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Aug 18, 2023

@ShabbyX the api dump shows that the tests often initiate resource deletion immediately after the last vkQueuePresentKHR, without vkQueueWaitIdle or fence status polling. The error is reported for the semaphore that was used in the last vkQueuePresentKHR. I think that's expected behavior in this case. The queue operation (inserted by vkQueuePresentKHR) that inspects the wait semaphore might not have started yet after vkQueuePresentKHR returns, so deleting semaphore is unsafe.

I have attached API dump for dEQP-EGL.functional.swap_buffers_with_damage.no_resize.render test:

apidump_semaphore_ea7170000000031.txt

I saw you participated in the discussions related to present semaphore deletion and as I understand without VK_EXT_swapchain_maintenance1 it's tricky to get it right and vkQueueWaitIdle is not a perfect solution. I think we can still support vkQueueWaitIdle to mark that the present semaphore is not in use anymore since many apps use this method, and drivers probably aware of this.

I'll check more tests if they generate 01137 in other scenarios and also will try to reproduce Jeremy's use case.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Aug 19, 2023

@jeremyg-lunarg I can reproduce your test case. It looks like it's a valid behavior. Image acquire does not happen on the queue and waiting for acquire fence does not guarantee that previous present operation (which happens on the queue) finished.

Strictly speaking, waiting on the queue also does not guarantee that present operation finished because present queue batch is not included in the synchronization scope of fence signaling operation (and queue wait is equivalent to wait on the fence). I believe we still assume that vkQueueWaitIdle is a practical clean up solution for no-VK_EXT_swapchain_maintenance1 apps. Adding vkQueueWaitIdle at the end of the suggested test, to ensure (kind of) that the present request is finished allows the test to pass.

Some references for documentation purposes :
That vkQueuePresentKHR was removed from signal operation synchronization scope:
KhronosGroup/Vulkan-Docs#1150 (comment)

One suggestion to return it back to submission order, so it can be picked up by QueueWaitIdle (did not happen though):
KhronosGroup/Vulkan-Docs#152 (comment)
Also entire issue 152 is a very informative discussion.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Aug 28, 2023

I run a bunch of test sessions on the latest angle and used different versions of vvl starting from the latest code, also tried the commit mentioned in the description of this ticket 27ba73e, and the commit just before that change 19314aa.

I can confirm that all these vvl versions report the same issue: the last vkQueuePresent is followed by vkDestroySemaphore (for the semaphore that present is waiting on) without intermediate vkQueueWaitIdle. This behavior was not introduced in 27ba73e, it is reproduceable before it, although the timings might indeed be different.

It's safe to destroy semaphore when angle inserts vkGetFenceStatus between vkQueuePresent and vkDestroySemaphore. But as @ShabbyX explained above, the fence status is not checked if the previous check after vkQueueSubmit reports READY.

Even if the fence check reports READY after vkQueueSubmit, this does not provide information about the completion of vkQueuePresent since it's a different queue operation, and it was issued after the fence status check. The present operation might still be pending on the queue. And when it's running, we must guarantee the wait semaphore is alive.

VVL expects that there is vkGetFenceStatus or vkQueueWaitIdle after present since it's the practical pre-VK_EXT_swapchain_maintenance1 way to signal intent to wait for the present during the application exit sequence.

The attached api dumps are from the latest vvl and also from the vvl commit before 27ba73e.

api_latest_vvl_(8366ffa)_error_and_no_fence_check.txt
api_latest_vvl_(8366ffa)_no_error_and_there_is_fence_check.txt
apidump_19314aa_(before_27ba73e)_no_error_and_there_is_fence_check.txt
apidump_19314aa_(before_27ba73e)_error_and_no_fence_check.txt

The fix suggestion is to always call vkQueueWaitIdle after the last vkQueuePresentKHR (independently from vkQueueSubmit fence check) and before the resource clean up sequence.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Aug 28, 2023

Closing this as not a VVL bug.

So far, all verified 01137 errors is an expected behavior when a semaphore is being destroyed while there is a pending queue operation (present) that references that semaphore. If there are other use cases that suggest a VVL bug, I would be happy to fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Synchronization Synchronization Validation Object Issue WSI Window System Integration related issues
Projects
None yet
Development

No branches or pull requests

4 participants