Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue Submission tracking no longer retiring correctly when Timeline Semaphores are reused across queues #3590

Closed
farnoy opened this issue Dec 5, 2021 · 6 comments · Fixed by #3610
Assignees

Comments

@farnoy
Copy link
Contributor

farnoy commented Dec 5, 2021

Describe the Issue
I've updated my local SDK from 1.2.189.2 to 1.2.198.1 and started getting errors like "resource still in use" where I did not get them before. I did some investigation and I think it's related to the recent refactoring of Queue submission state tracking @jeremy-lunarg . EDIT: I pinged the wrong Jeremy, sorry! @jeremyg-lunarg

The current version seems to be retiring work on the specific queue that is going to signal this semaphore (not 100% sure)

if (signaler.queue) {

Whereas the previous version, before it was refactored, iterated all the queues to check what work can be retired as of this semaphore:

for (const auto &pair : queueMap) {

With this observation in mind, I modified my app to never reuse the same Timeline Semaphore for submissions across different queue families. It normally does a dependency analysis to pack submissions onto the smallest number of timeline semaphores (unique paths through a DAG).

This resolved my issue and I haven't seen it since. I don't have a simple C++ repro, but I believe this issue arises when:

  1. A Timeline Semaphore is used as a signal semaphore on queue A
  2. The same Timeline Semaphore is used as a signal on queue B of a different family
  3. vkWaitSemaphores is used on the host to wait for the first submission to complete

I'm not sure if the validation layers end up retiring the second submission on queue B correctly. The errors I saw implied they haven't retired the first one.

Interestingly, I could not reproduce this with multiple queues of the same family. My graph analysis also finds independent compute work and submits them to different queues within the same family, but that doesn't seem to cause an issue.

Valid Usage ID
All the ones for "resource still in use" could probably get this. I've personally seen errors from vkResetCommandPool and vkDestroyAccelerationStructure:

  • VUID-vkResetCommandPool-commandPool-00040

Environment:

  • OS: Windows 10 64b
  • GPU: RX 6900 XT with driver: 21.11.3
  • SDK or header version if building from repo: SDK 1.2.198.1
  • Options enabled (synchronization, best practices, etc.): Just the standard_validation layer

Additional context

@jeremyg-lunarg jeremyg-lunarg self-assigned this Dec 6, 2021
@farnoy
Copy link
Contributor Author

farnoy commented Dec 6, 2021

I was able to reproduce it with different queues of the same family, it was just a tricky timing to hit with my app. In the end, I think it can happen with any pair of queues, regardless of their families.

@jeremyg-lunarg
Copy link
Contributor

I was able to reproduce it with different queues of the same family, it was just a tricky timing to hit with my app. In the end, I think it can happen with any pair of queues, regardless of their families.

That's good! It would be really strange if it required separate queue families. I'll try to come up with a test case that reproduces the problem based on your description.

@farnoy
Copy link
Contributor Author

farnoy commented Dec 6, 2021

Yep, I expected that if my hypothesis is true, this should be reproducible within the same queue family, but it was harder to do so in my setup.

By the way, I think the until_payload parameters could be improved by calling vkGetSemaphoreCounterValue to retire submissions until the most recent available right now.

Imagine a timeline semaphore that starts at 1 and is incremented by 2 on every submission (that also waits for the previous one in this sequence). For a Semaphore like this, I think it's legal for the app to use vkWaitForSemaphores with even values to check that the submission signaling x+1 has completed.

So the entire timeline would look like this:

signals: 1  ->  3 ->  5 -> ...
host waits:  2 ->  4 -> ...

From the app's perspective, the value being >= 2 automatically implies that it's >= 3, because it only signals odd numbers. But the state tracker in these validations will only retire submissions that signaled <= 2 when the host wait finishes, which would actually miss the exact submission that signaled it. I don't know of anything in the spec that would make this app code invalid.

The call to vkGetSemaphoreCounterValue only takes 400ns here, but I don't know how practical it is on other implementations.

@jeremyg-lunarg
Copy link
Contributor

jeremyg-lunarg commented Dec 9, 2021

@farnoy, I'm finally getting around to looking at this in detail. I'm not sure if the scenario you're describing in the original report is valid usage. (I'm also not sure if validation is handling it correctly). Let me restate it and try to explain the problem:

  1. A Timeline Semaphore is used as a signal semaphore on queue A, with value = 1.
  2. The same Timeline Semaphore is used as a signal on queue B of a different family, with value = 2
  3. vkWaitSemaphores is used on the host to wait for the first submission to complete (wait for value >=1 ).

The problem is that nothing in the above ensures that the commands in queue A have actually executed. If A is extremely busy, and B is not, the work on A that will signal value =1 could still have not executed yet. But the vkWaitSemaphores() call would return because B set value = 2. Even worse, the implementation would have to have some way to ensure that the timeline value didn't go backwards when A finally executed and tried to set value = 1. This condition is

VUID-VkSubmitInfo-pSignalSemaphores-03242
For each element of pSignalSemaphores created with a VkSemaphoreType of VK_SEMAPHORE_TYPE_TIMELINE the 
corresponding element of VkTimelineSemaphoreSubmitInfo::pSignalSemaphoreValues must have a value greater 
than the current value of the semaphore when the semaphore signal operation is executed

The only way this could work reliably is if B used the semaphore as a wait semaphore with value = 1 BEFORE using it as a signal semaphore with value = 2.

Does that make sense to you? Like I said, I think validation isn't handling this situation correctly, but if it was you should be getting 03242. I'll set up a test case to trigger it, as well as the correct sequence with the wait semaphore within the 2 signals.

@farnoy
Copy link
Contributor Author

farnoy commented Dec 9, 2021

That's my bad, I missed this detail when I wrote down the steps, but yes, I was implicitly assuming there was a dependency between #1 and #2, otherwise it's not a valid usage as you've found. I'm confident my app was also inserting this dependency which is why it validated on the previous SDK & there's no corruption.

After adding the dependency between steps 1 and 2, were you able to reproduce the issue? I think you need a separate gadget, like a VkImage that you transition inside submission #1 and you then destroy it on the host after waiting in step 3. My hypothesis is that the validation layers never retire the submission from step 1, so all resources referenced within are still considered to be in use after step 3.

@jeremyg-lunarg
Copy link
Contributor

After adding the dependency between steps 1 and 2, were you able to reproduce the issue?

Yes I'm able to reproduce this now, with 2 queues and 2 command buffers that do vkCmdCopyCmdBuffer().

jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 16, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 16, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 17, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 20, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 21, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 21, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 21, 2021
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Dec 21, 2021
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 1, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 1, 2022
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 1, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 1, 2022
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 2, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 2, 2022
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 2, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.

This change also adds checking for:
VUID-VkAcquireNextImageInfoKHR-semaphore-01781
VUID-vkAcquireNextImageKHR-semaphore-01779
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 2, 2022
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 3, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 3, 2022
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 3, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 3, 2022
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 6, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes KhronosGroup#3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.
jeremyg-lunarg added a commit to jeremyg-lunarg/Vulkan-ValidationLayers that referenced this issue Jan 6, 2022
jeremyg-lunarg added a commit that referenced this issue Jan 6, 2022
Make all non-const member data private and restructure semaphore
tracking to allow it to eventually be made thread safe. The
biggest change is that each semaphore's state now keeps a
queue of pending operations, which removes the need to
frequently walk all pending submissions in every queue.

Fixes #3590, which was introduced when one the 'for every queue'
loops was removed in a previous commit.
jeremyg-lunarg added a commit that referenced this issue Jan 6, 2022
This test is a reproduction case for Issue #3590
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants