Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance on mac #1135

Closed
freesig opened this issue Dec 12, 2018 · 12 comments
Closed

Slow performance on mac #1135

freesig opened this issue Dec 12, 2018 · 12 comments

Comments

@freesig
Copy link
Contributor

freesig commented Dec 12, 2018

  • Version of vulkano: 0.11.1
  • OS: Macos 10.14.1 Mojave
  • GPU (the selected PhysicalDevice): Intel HD 6000
  • GPU Driver: MoltenVK 1.0.27

To produce call cleanup_finished() on the previous frames GpuFuture each frame.
This happens in the triangle example.

Issue

Getting very slow performance with a sequence of images with moltenvk.
It starts rending very slowly but progressively gets quicker.
I'm not sure if this is a problem with vulkano or moltenvk.
The same code runs fine on linux.

However if I change this line:

match fence.wait(Some(Duration::from_secs(0))) {

to

match fence.wait(Some(Duration::from_millis(1))) {

Then it works perfectly. Except that molten spits out a timeout error.
Another option is to change the section to this:

std::thread::sleep(Duration::from_millis(1));
match fence.ready() {
  Ok(true) => unsafe {
    prev.signal_finished()
  },
  Ok(false) => {
    prev.cleanup_finished();
    return;
  },
  Err(_) => {
    prev.cleanup_finished();
    return;
  },

Note I tried going down to from_nanos(1) but then it runs slowly again.
Also this "fix" doesn't spit out the Molten timeout error

Profiling
I profiled the original vulkano code with the bug and I noticed that _$LT$vulkano..sync..future..fence_signal..FenceSignalFuture$LT$F$GT$$u20$as$u20$vulkano..sync..future..GpuFuture$GT$::cleanup_finished::h4c367c6caff4531e
I believe that this is actually cleanup_finished_impl but shows as cleanup_finished because of the release optimizations
Is calling itself recursively but the stack is beyond my profilers limit at 500 calls. So I don't know how deep it is going. Also it was using about 97% CPU.

With the above fix it is still recursing but only to ~40 deep. I assume the sleeping is preventing it from going deeper. It also only uses 22% CPU.

MoltenVK
Looking at the implementation of both vkWaitForFences and vkGetFenceStatus you end up locking the fences mutex:

bool MVKFence::getIsSignaled() {
	lock_guard<mutex> lock(_lock);
	return _isSignaled;
}

But at the same time the signal needs the lock:

void MVKFence::signal() {
	lock_guard<mutex> lock(_lock);

	if (_isSignaled) { return; }	// Only signal once
	_isSignaled = true;

	// Notify all the fence sitters, and clear them from this instance.
    for (auto& fs : _fenceSitters) {
        fs->fenceSignaled(this);
    }
	_fenceSitters.clear();
}
Perhaps this creates a race to get the lock but I'm not sure.

What I think is happening

At the start of the program there is more setup stuff going on before the fence is signaled and while the fence isn't signaled vulkano is constantly calling vkWaitForFences but getting back a busy VK_TIMEOUT then recursively trying again.
This is cause very deep callstacks and once a signaled fence is found it has to return all the way back up this giant callstack.

I'm not sure why this only happens on Moltenvk and not regular Vulkan. Perhaps Molten is just slower or the underlying Metal doesn't signal fences as efficiently. Or maybe the vkWaitForFences with a 0 timeout actually takes a bit of time to return of regular vulkan and this prevents the deep recursion.

I should note that I'm getting this error regardless of the above "fix".

[vulkan] Validation error:  [ VUID-vkResetFences-pFences-01123 ] Object: 0x2c (Type = 7) | Fence 0x2c is in use. The Vulkan spec states: Each element of pFences must not be currently associated with any queue command that has not yet completed execution on that queue (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetFences-pFences-01123)
@mitchmindtree
Copy link
Contributor

Summary

So to clarify/summarise, it seems the reason for this recursion is that futures associated with previous frames are not being cleaned up properly on macos. The reason for this seems to be related to the call to fence.wait and the given timeout indicating that perhaps there is some sort of lock contension / synchronisation issue going on in the moltenvk backend. It is not clear if MoltenVK is to blame or if vulkano is mis-using it. As a result of this issue, the linked list of futures keeps growing infinitely long resulting in massive recursion when cleanup_finished is called that I'm guessing would eventually overflow the stack if the program ran long enough.

Some thoughts:

  1. Does this issue occur in the triangle example? When @JoshuaBatty and I originally noticed this the super slow frame rate only became obvious when we started to use image attachments, but we didn't notice it in the triangle example (we didn't actually check). Curiously, while the rate at which the displayed image would update was extremely slow, the rate at which the main loop ran stayed about the same. Is this what you are seeing? Come to think of it, this makes sense w.r.t. the behaviour in the futures too. cleanup_finished would recurse so deeply because the initial frames that we had submitted were taking so long to process and be displayed, thus the nested-previous-frame-futures would never get dropped as the GPU computation associated with them was not finished. It is very strange that increasing that fence.wait timeout seems to completely remedy this.

  2. You mention when you increase the timeout that

    Then it works perfectly. Except that molten spits out a timeout error.

    I think this is actually a good thing - we want a timeout error if that fence isn't ready yet. I think we actually want to be getting that same error when the timeout is set as 0 if the fence isn't immediately ready, but it sounds like from this issue that this isn't the case? What is returned when the timeout is 0?

    Note I tried going down to from_nanos(1) but then it runs slowly again.

    This is really weird. I can understand 0 having an edge case but if the same thing happens with one nanosecond then it starts to sound like a timing/synchronisation bug, possibly related to the locking like you mention.

  3. You mention that you're getting 97% CPU usage - is the CPU usage this high from the beginning or does it slowly increase up to this? Does profiling show that most of this time is being spent in that recursive cleanup_finished call? Or is it because your main loop is running super fast? If you time the rate at which your loop is running, is it running at roughly the refresh rate of your display, i.e. ~60hz?

@mitchmindtree
Copy link
Contributor

#931 and #990 look like they're potentially related.

@freesig
Copy link
Contributor Author

freesig commented Dec 12, 2018

it seems the reason for this recursion is that futures associated with previous frames are not being cleaned up properly on macos.

I'm not certain about this, it could be the case but I think they might be getting cleaned up but not as fast as we are calling vkWaitForFences . Everytime we call that function and get back a timeout we recurse another level. These recursions then take time to climb back up.

Does this issue occur in the triangle example?

Yeh it does happen.

but it sounds like from this issue that this isn't the case? What is returned when the timeout is 0?

To clarify both calls return a VK_TIMEOUT if the fence isn't signaled but when it's not set to 0 MoltenVK prints it's own error message.

This is really weird.

To clarify if you call vkWaitForFenceswith 1 nano second it works but not if you wait 1 nano second before callingvkGetFenceStatus. I think whats happening here is when you call vkWaitForFenceswith 0 Molten does pretty much the exact same thing asvkGetFenceStatus. They both return very quickly. But if you call vkWaitForFences` with a timeout > 0 then it creates a fence sitter which would probably use up much more then 1 nanosecond.

This is what makes me think we are just calling the function much quicker then Moltenvk is signalling it's fences and therefor digging down a deep recursion stack.

is the CPU usage this high from the beginning or does it slowly increase up to this? Does profiling show that most of this time is being spent in that recursive cleanup_finished call?

Seems to jump up there very quickly. Yeh it's all in the cleanup although I'm not sure my profiler is super accurate due to hitting it's recursion limit.

@freesig
Copy link
Contributor Author

freesig commented Dec 23, 2018

I'm a little stuck on how to come up with a solution for this problem. But some ideas:

  1. We make the assumption that with regular Vulkan vkWaitForFences with a 0 timeout actually takes a bit of time to return where as with MoltenVK it returns almost instantly.
    I'm not sure how to test this assumption.
    What I'm trying see is if Vulkano is getting these run away call stacks because there is almost no delay in calling vkWaitForFences.
  2. Another possible explanation is that MoltenVK is not signalling fences and they are building up.
    But I don't see why putting a 1 millisecond delay in would solve this.

@freesig
Copy link
Contributor Author

freesig commented Dec 23, 2018

This is the recursive loop:
fence_signal.rs:173
swapchain.rs:990
command_buffer/traits.rs:208
future/join.rs:73
future/mod.rs:265
fence_signal.rs:334
fence_signal.rs:173

@mitchmindtree
Copy link
Contributor

@freesig I just noticed that there is a wait method on the FenceSignalFuture returned by .then_signal_fence_and_flush(). It seems by calling wait you can force the CPU to wait for the moment that the GPU finally signals the fence.

I wonder if calling .wait(None) on this each frame rather than casting the future to a Box<GpuFuture> and chaining it with the previous frame would fix the behaviour on macos in the triangle.rs example? This would of course mean that time could be wasted by having to wait for the GPU to present each frame, but it would be a great experiment to know if wait could be used to get around the issue of the CPU outpacing the GPU.

If this does fix the behaviour on macos, perhaps we could change previous_frame_end to store the FenceSignalFuture rather than a Box<GpuFuture> so that we can call .wait(None) on it right before we try and present the current frame. In most cases, we shouldn't have to wait long at all as presenting the previous frame should in theory be finished long before we need to present the current frame. I think it's also not unreasonable for low-latency programs to want to make sure that the previous frame has finished presenting before continuing on with their game loop on the CPU, otherwise it would be easy for presentation to fall behind user input.

@freesig
Copy link
Contributor Author

freesig commented Jan 2, 2019

Yep so calling .wait(None) does work. The frames still seem to be running at about 1ms but the teapot example looks fine with this change and there slow down or speed up.

The problem with using FenceSignalFuture as previous frame is that we can't call .cleanup_finished() which according to the docs is important to call?

Also it seems like if we don't call .cleanup_finished() then .wait(None) is called in the drop anyway.

@mitchmindtree
Copy link
Contributor

Ohh hmm, are you sure you can't call it? I think GpuFuture is implemented for FenceSignalFuture so you should still be able to call .cleanup_finished().

Anyway, that's awesome news that FenceSignalFuture::wait is working! At least we don't have to wait for some upstream fix or anything. I think all we should have to do is add something like this .wait(None) patch to each of the examples so that users testing out or copying the examples avoid the confusion. @rukai thoughts?

The frames still seem to be running at about 1ms

Yeah it would be great to have an example that reliably synced with the refresh rate of the display, but I guess we can probably work out how to deal with this in another issue.

@freesig
Copy link
Contributor Author

freesig commented Jan 2, 2019

Actually yeh you can call .cleanup_finished() on the FenceSignalFuture. I can't get it to work though because it's actually JoinFuture<FenceSignalFuture> but there's nothing to join it with on the first frame. They are cast it to Box so they can store both JoinFuture<FenceSignalFuture> and JoinFuture<NowFuture>.

@freesig
Copy link
Contributor Author

freesig commented Jan 2, 2019

This is fixed in for Nannou however the vulkano examples still won't work on properly macos.
We could simply add:

 .then_signal_fence_and_flush();
future.wait(None).unwrap();

after to the examples (I think some already have it).
Then close this issue.

I do think that it's possible that something is wrong with when MoltenVK signals the acquire_next_image() future because the frames are still completing in 1ms instead of 15ms. It doesn't appear to cause problems though as there's no longer a large build up of recursive calls to cleanup finished.

My understanding of MoltenVK is uses a Metal command buffer to submit all the current "presents" and once the GPU executes a present it will signal the swapchain image that it's free. I'm not sure why this is happening faster then refresh rate. Either it's building up more "presents" in that command buffer then max swapchain images or some sort of mailbox situation is happening where it is overriding the swapchain image with the latest.

I'm happy to keep digging into this but also happy to close the issue by adding the .wait(None) fix to the examples.

@rukai
Copy link
Member

rukai commented Jan 4, 2019

Sorry, I've been struggling to keep up with everything on vulkano recently. :(
I've only skim read this thead.
Whether the examples should have .wait(None) is something that I have investigated before but never been able to come to a conclusion on. If you want to do a PR to add it to the examples, I would want a solid explanation as to why that is the correct thing to do.

@AustinJ235
Copy link
Member

Please refer to #1247 for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants