Slow performance on mac #1135

freesig · 2018-12-12T03:39:14Z

Version of vulkano: 0.11.1
OS: Macos 10.14.1 Mojave
GPU (the selected PhysicalDevice): Intel HD 6000
GPU Driver: MoltenVK 1.0.27

To produce call cleanup_finished() on the previous frames GpuFuture each frame.
This happens in the triangle example.

Issue

Getting very slow performance with a sequence of images with moltenvk.
It starts rending very slowly but progressively gets quicker.
I'm not sure if this is a problem with vulkano or moltenvk.
The same code runs fine on linux.

However if I change this line:

vulkano/vulkano/src/sync/future/fence_signal.rs

Line 167 in 9a08414

match fence.wait(Some(Duration::from_secs(0))) {

to

match fence.wait(Some(Duration::from_millis(1))) {

Then it works perfectly. Except that molten spits out a timeout error.
Another option is to change the section to this:

std::thread::sleep(Duration::from_millis(1));
match fence.ready() {
  Ok(true) => unsafe {
    prev.signal_finished()
  },
  Ok(false) => {
    prev.cleanup_finished();
    return;
  },
  Err(_) => {
    prev.cleanup_finished();
    return;
  },

Note I tried going down to from_nanos(1) but then it runs slowly again.
Also this "fix" doesn't spit out the Molten timeout error

Profiling
I profiled the original vulkano code with the bug and I noticed that _$LT$vulkano..sync..future..fence_signal..FenceSignalFuture$LT$F$GT$$u20$as$u20$vulkano..sync..future..GpuFuture$GT$::cleanup_finished::h4c367c6caff4531e
I believe that this is actually cleanup_finished_impl but shows as cleanup_finished because of the release optimizations
Is calling itself recursively but the stack is beyond my profilers limit at 500 calls. So I don't know how deep it is going. Also it was using about 97% CPU.

With the above fix it is still recursing but only to ~40 deep. I assume the sleeping is preventing it from going deeper. It also only uses 22% CPU.

MoltenVK
Looking at the implementation of both vkWaitForFences and vkGetFenceStatus you end up locking the fences mutex:

bool MVKFence::getIsSignaled() {
	lock_guard<mutex> lock(_lock);
	return _isSignaled;
}

But at the same time the signal needs the lock:

void MVKFence::signal() {
	lock_guard<mutex> lock(_lock);

	if (_isSignaled) { return; }	// Only signal once
	_isSignaled = true;

	// Notify all the fence sitters, and clear them from this instance.
    for (auto& fs : _fenceSitters) {
        fs->fenceSignaled(this);
    }
	_fenceSitters.clear();
}
Perhaps this creates a race to get the lock but I'm not sure.

What I think is happening

At the start of the program there is more setup stuff going on before the fence is signaled and while the fence isn't signaled vulkano is constantly calling vkWaitForFences but getting back a busy VK_TIMEOUT then recursively trying again.
This is cause very deep callstacks and once a signaled fence is found it has to return all the way back up this giant callstack.

I'm not sure why this only happens on Moltenvk and not regular Vulkan. Perhaps Molten is just slower or the underlying Metal doesn't signal fences as efficiently. Or maybe the vkWaitForFences with a 0 timeout actually takes a bit of time to return of regular vulkan and this prevents the deep recursion.

I should note that I'm getting this error regardless of the above "fix".

[vulkan] Validation error:  [ VUID-vkResetFences-pFences-01123 ] Object: 0x2c (Type = 7) | Fence 0x2c is in use. The Vulkan spec states: Each element of pFences must not be currently associated with any queue command that has not yet completed execution on that queue (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetFences-pFences-01123)

The text was updated successfully, but these errors were encountered:

mitchmindtree · 2018-12-12T07:34:28Z

Summary

So to clarify/summarise, it seems the reason for this recursion is that futures associated with previous frames are not being cleaned up properly on macos. The reason for this seems to be related to the call to fence.wait and the given timeout indicating that perhaps there is some sort of lock contension / synchronisation issue going on in the moltenvk backend. It is not clear if MoltenVK is to blame or if vulkano is mis-using it. As a result of this issue, the linked list of futures keeps growing infinitely long resulting in massive recursion when cleanup_finished is called that I'm guessing would eventually overflow the stack if the program ran long enough.

Some thoughts:

Does this issue occur in the triangle example? When @JoshuaBatty and I originally noticed this the super slow frame rate only became obvious when we started to use image attachments, but we didn't notice it in the triangle example (we didn't actually check). Curiously, while the rate at which the displayed image would update was extremely slow, the rate at which the main loop ran stayed about the same. Is this what you are seeing? Come to think of it, this makes sense w.r.t. the behaviour in the futures too. cleanup_finished would recurse so deeply because the initial frames that we had submitted were taking so long to process and be displayed, thus the nested-previous-frame-futures would never get dropped as the GPU computation associated with them was not finished. It is very strange that increasing that fence.wait timeout seems to completely remedy this.
You mention when you increase the timeout that

Then it works perfectly. Except that molten spits out a timeout error.

I think this is actually a good thing - we want a timeout error if that fence isn't ready yet. I think we actually want to be getting that same error when the timeout is set as 0 if the fence isn't immediately ready, but it sounds like from this issue that this isn't the case? What is returned when the timeout is 0?

Note I tried going down to from_nanos(1) but then it runs slowly again.

This is really weird. I can understand 0 having an edge case but if the same thing happens with one nanosecond then it starts to sound like a timing/synchronisation bug, possibly related to the locking like you mention.
You mention that you're getting 97% CPU usage - is the CPU usage this high from the beginning or does it slowly increase up to this? Does profiling show that most of this time is being spent in that recursive cleanup_finished call? Or is it because your main loop is running super fast? If you time the rate at which your loop is running, is it running at roughly the refresh rate of your display, i.e. ~60hz?

mitchmindtree · 2018-12-12T07:39:44Z

#931 and #990 look like they're potentially related.

freesig · 2018-12-12T11:09:45Z

it seems the reason for this recursion is that futures associated with previous frames are not being cleaned up properly on macos.

I'm not certain about this, it could be the case but I think they might be getting cleaned up but not as fast as we are calling vkWaitForFences . Everytime we call that function and get back a timeout we recurse another level. These recursions then take time to climb back up.

Does this issue occur in the triangle example?

Yeh it does happen.

but it sounds like from this issue that this isn't the case? What is returned when the timeout is 0?

To clarify both calls return a VK_TIMEOUT if the fence isn't signaled but when it's not set to 0 MoltenVK prints it's own error message.

This is really weird.

To clarify if you call vkWaitForFenceswith 1 nano second it works but not if you wait 1 nano second before callingvkGetFenceStatus. I think whats happening here is when you call vkWaitForFenceswith 0 Molten does pretty much the exact same thing asvkGetFenceStatus. They both return very quickly. But if you call vkWaitForFences` with a timeout > 0 then it creates a fence sitter which would probably use up much more then 1 nanosecond.

This is what makes me think we are just calling the function much quicker then Moltenvk is signalling it's fences and therefor digging down a deep recursion stack.

is the CPU usage this high from the beginning or does it slowly increase up to this? Does profiling show that most of this time is being spent in that recursive cleanup_finished call?

Seems to jump up there very quickly. Yeh it's all in the cleanup although I'm not sure my profiler is super accurate due to hitting it's recursion limit.

freesig · 2018-12-23T07:27:22Z

I'm a little stuck on how to come up with a solution for this problem. But some ideas:

We make the assumption that with regular Vulkan vkWaitForFences with a 0 timeout actually takes a bit of time to return where as with MoltenVK it returns almost instantly.
I'm not sure how to test this assumption.
What I'm trying see is if Vulkano is getting these run away call stacks because there is almost no delay in calling vkWaitForFences.
Another possible explanation is that MoltenVK is not signalling fences and they are building up.
But I don't see why putting a 1 millisecond delay in would solve this.

freesig · 2018-12-23T07:56:22Z

This is the recursive loop:
fence_signal.rs:173
swapchain.rs:990
command_buffer/traits.rs:208
future/join.rs:73
future/mod.rs:265
fence_signal.rs:334
fence_signal.rs:173

mitchmindtree · 2018-12-30T03:08:01Z

@freesig I just noticed that there is a wait method on the FenceSignalFuture returned by .then_signal_fence_and_flush(). It seems by calling wait you can force the CPU to wait for the moment that the GPU finally signals the fence.

I wonder if calling .wait(None) on this each frame rather than casting the future to a Box<GpuFuture> and chaining it with the previous frame would fix the behaviour on macos in the triangle.rs example? This would of course mean that time could be wasted by having to wait for the GPU to present each frame, but it would be a great experiment to know if wait could be used to get around the issue of the CPU outpacing the GPU.

If this does fix the behaviour on macos, perhaps we could change previous_frame_end to store the FenceSignalFuture rather than a Box<GpuFuture> so that we can call .wait(None) on it right before we try and present the current frame. In most cases, we shouldn't have to wait long at all as presenting the previous frame should in theory be finished long before we need to present the current frame. I think it's also not unreasonable for low-latency programs to want to make sure that the previous frame has finished presenting before continuing on with their game loop on the CPU, otherwise it would be easy for presentation to fall behind user input.

freesig · 2019-01-02T04:02:45Z

Yep so calling .wait(None) does work. The frames still seem to be running at about 1ms but the teapot example looks fine with this change and there slow down or speed up.

The problem with using FenceSignalFuture as previous frame is that we can't call .cleanup_finished() which according to the docs is important to call?

Also it seems like if we don't call .cleanup_finished() then .wait(None) is called in the drop anyway.

mitchmindtree · 2019-01-02T05:10:32Z

Ohh hmm, are you sure you can't call it? I think GpuFuture is implemented for FenceSignalFuture so you should still be able to call .cleanup_finished().

Anyway, that's awesome news that FenceSignalFuture::wait is working! At least we don't have to wait for some upstream fix or anything. I think all we should have to do is add something like this .wait(None) patch to each of the examples so that users testing out or copying the examples avoid the confusion. @rukai thoughts?

The frames still seem to be running at about 1ms

Yeah it would be great to have an example that reliably synced with the refresh rate of the display, but I guess we can probably work out how to deal with this in another issue.

freesig · 2019-01-02T05:48:09Z

Actually yeh you can call .cleanup_finished() on the FenceSignalFuture. I can't get it to work though because it's actually JoinFuture<FenceSignalFuture> but there's nothing to join it with on the first frame. They are cast it to Box so they can store both JoinFuture<FenceSignalFuture> and JoinFuture<NowFuture>.

freesig · 2019-01-02T22:34:53Z

This is fixed in for Nannou however the vulkano examples still won't work on properly macos.
We could simply add:

 .then_signal_fence_and_flush();
future.wait(None).unwrap();

after to the examples (I think some already have it).
Then close this issue.

I do think that it's possible that something is wrong with when MoltenVK signals the acquire_next_image() future because the frames are still completing in 1ms instead of 15ms. It doesn't appear to cause problems though as there's no longer a large build up of recursive calls to cleanup finished.

My understanding of MoltenVK is uses a Metal command buffer to submit all the current "presents" and once the GPU executes a present it will signal the swapchain image that it's free. I'm not sure why this is happening faster then refresh rate. Either it's building up more "presents" in that command buffer then max swapchain images or some sort of mailbox situation is happening where it is overriding the swapchain image with the latest.

I'm happy to keep digging into this but also happy to close the issue by adding the .wait(None) fix to the examples.

rukai · 2019-01-04T07:30:05Z

Sorry, I've been struggling to keep up with everything on vulkano recently. :(
I've only skim read this thead.
Whether the examples should have .wait(None) is something that I have investigated before but never been able to come to a conclusion on. If you want to do a PR to add it to the examples, I would want a solid explanation as to why that is the correct thing to do.

AustinJ235 · 2019-10-23T08:16:47Z

Please refer to #1247 for more information.

mitchmindtree mentioned this issue Dec 12, 2018

I wanna help make iOS work #990

Closed

freesig mentioned this issue Dec 23, 2018

ResetFences nannou-org/nannou#217

Closed

mitchmindtree mentioned this issue Jan 2, 2019

Memory leak with CpuBufferPool on macOS #931

Closed

mitchmindtree mentioned this issue Jan 2, 2019

Change RefreshSync loop mode to wait on previous frame presentation nannou-org/nannou#239

Merged

mitchmindtree mentioned this issue Jan 4, 2019

moltenvk 1.1.92 causes empty screen and validation errors #1146

Closed

matthew-russo mentioned this issue Feb 16, 2019

22 descriptor pools and sets bwasty/vulkan-tutorial-rs#7

Open

simdimdim mentioned this issue Jul 25, 2019

Running any Vulkano projects results in a freeze and crash #1217

Closed

AustinJ235 closed this as completed Oct 23, 2019

AustinJ235 mentioned this issue Oct 23, 2019

Crashing/Freezing/OutOfHostMemory with Nvidia or on MacOS systems #1247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance on mac #1135

Slow performance on mac #1135

freesig commented Dec 12, 2018 •

edited

Loading

mitchmindtree commented Dec 12, 2018

mitchmindtree commented Dec 12, 2018

freesig commented Dec 12, 2018

freesig commented Dec 23, 2018

freesig commented Dec 23, 2018

mitchmindtree commented Dec 30, 2018

freesig commented Jan 2, 2019

mitchmindtree commented Jan 2, 2019

freesig commented Jan 2, 2019

freesig commented Jan 2, 2019

rukai commented Jan 4, 2019 •

edited

Loading

AustinJ235 commented Oct 23, 2019

Slow performance on mac #1135

Slow performance on mac #1135

Comments

freesig commented Dec 12, 2018 • edited Loading

Issue

What I think is happening

mitchmindtree commented Dec 12, 2018

Summary

mitchmindtree commented Dec 12, 2018

freesig commented Dec 12, 2018

freesig commented Dec 23, 2018

freesig commented Dec 23, 2018

mitchmindtree commented Dec 30, 2018

freesig commented Jan 2, 2019

mitchmindtree commented Jan 2, 2019

freesig commented Jan 2, 2019

freesig commented Jan 2, 2019

rukai commented Jan 4, 2019 • edited Loading

AustinJ235 commented Oct 23, 2019

freesig commented Dec 12, 2018 •

edited

Loading

rukai commented Jan 4, 2019 •

edited

Loading