Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix increased audio latency and increased pitch after fast-forwarding ends #12038

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nyanpasu64
Copy link
Contributor

@nyanpasu64 nyanpasu64 commented Jul 9, 2023

When fast-forwarding emulation, the CPU thread fills the audio buffers with sound faster (Mixer::MixerFifo::PushSamples()) than sound gets played back (Mixer::MixerFifo::Mix()). When the buffer fills up PushSamples() drops incoming samples until enough samples get read to make room for pushing a new buffer.

When fast-forwarding finishes, the excessive buffer contents were never cleared. This caused audio to remain playing at an elevated pitch for 10 seconds (assuming a 48 kHz system audio rate), slowly dropping to normal over ~10 more seconds.

In this PR, I changed Mixer::MixerFifo::Mix() to detect when fast-forwarding ends, and reset the read pointer to (low_watermark = target latency) (audio frames = samples not * channels) before the write pointer, and reset the current and smoothed latency values (used to calculate resampling ratio) to match. This causes a pop in the audio, but it's fine because during fast-forward the audio is already garbled. (The modified code should never be run when audio stretching is enabled and consider_framelimit=false.)

I also had to change CoreTimingManager::Throttle() to properly throttle emulation immediately when you released the fast-forward key, so it wouldn't generate more audio after the mixer jumped ahead to catch up. Previously m_throttle_deadline was not being updated during fast-forward, causing the emulator to try to render as many frames as possible in zero seconds. And when you released fast-forward, the emulator would continue to fast-forward for an emulated time equal to the physical time you held the key, waiting for m_throttle_deadline to catch up to the physical time. This was only limited when the "System can not to keep up with timings!" fallback would keep the target timebase 100ms before the current time, so the emulator would never fast-forward more than 100ms emulated time after you released Tab.

After fast-forwarding finishes, audio resumes playing at the perfect latency (offset is at ±single digits as opposed to 200 previously). When combined with #12035, with vsync enabled, audio may drop in pitch for 10 seconds or so after fast-forward ends, because emulation is being slowed down (generating less audio) because it's generating frames too soon relative to vsync.

Tested on Linux X11.

Concerns

  • This does not fix audio going flat when the game starts, or going sharp if the game pauses briefly and afterwards CoreTiming fast-forwards to compensate.
    • It may be a good idea to play silence until the buffer fills up to our latency target (then reset the resampling ratio to default), or reset buffer size after discontinuities are resolved and CoreTiming stabilizes. I'd have to perform further testing before I come up with any concrete ideas on how to better improve the situation (and I personally think it's a bad idea for CoreTiming to fast-forward games after a slowdown, but that's another topic altogether).
    • IMO that requires heuristics to recognize and process reliably, and may create a second pop if it happens in gameplay rather than startup. It's worth special-casing fast-forward (and merging this PR) because it's more common than game startup and can be solved reliably without heuristics, regardless of whether we make further changes or not because this is good enough. (Also the other commits in the PR are good changes even if we later revert the fast-forward fix.)
  • Should I keep ASSERT(numLeft == static_cast<float>(((indexW - indexR) & INDEX_MASK) / 2))?
  • Should I squash commits?
  • I only resynchronize m_dma_mixer (the main game audio synthesized in real time) when fast-forward ends. Ideally you'd resynchronize all of them to gameplay.

Ideally all mixers would share a locked resampling offset to prevent them from going out of sync with each other, which could be a problem in DTK-based games with sound effects or user inputs synchronized with audio. But in practice, since the mixers are generated and consumed in sync, I'd hope it's unlikely for them to desynchronize in normal gameplay (and this PR prevents them from desyncing after fast-forwarding). And since the mixers inherently run at different rates and have different durations (in real time), I don't think there's a good way to keep them synced once they buffers overflow or underflow.

  • I did not remove my debug logging and "remove debug logging" commits from this PR. They are useful during development to see whether audio is being pitch-shifted or not (without having to second-guess subtle changes in your pitch perception), but pollute the Git history. Should I do that before it's merged?
  • Given a sufficiently brief fast-forward, it's theoretically possible for Mixer::PushSamples to miss Core::GetIsThrottlerTempDisabled() = true even though CoreTiming observed it. This is unlikely to occur in practice, since AudioCommon::SendAIBuffer() is supposed to execute at 4khz (0.25ms) while HotkeyScheduler::Run() updates hotkey state every 5ms.
    • Do we need a fast-forward generation counter to check if we've fast-forwarded at all since the last read? Or is this failure mode too unlikely/unimportant to bother complicating the code?
  • The existing code updates s_is_throttler_temp_disabled on HotkeyScheduler but reads it on the CPU and Video threads. This is a data race. My code adds a new read of this flag on the CPU thread.
    • Perhaps it should be a relaxed atomic? At which point we can read s_is_throttler_temp_disabled directly in Mixer::MixerFifo::Mix(), avoiding the m_resyncReader song and dance altogether? This also has the positive side effect of making all active mixers resynchronize themselves automatically, whereas right now I only tell m_dma_mixer to EndFastForward().
    • Though this will reset the buffer reader's latency earlier on, before Mixer::PushSamples observes the change. I'll have to check whether this resets the buffer before PushSamples() finishes running too quickly (EDIT: it's ok). (Though even my current code doesn't wait for CoreTimingManager to see the change, and it still works well in practice.)
    • Screw it, move the check to Mixer::Mix so we only read the value once, so all mixers either reset together simultaneously or don't reset.
  • (outdated) I rolled my own atomics instead of using Common::Flag because I'm unfamiliar with using Flag, and I think my code is more efficient than the current implementation. If necessary I could replace it with the existing flag code (it'll probably not be noticeably slower, considering we don't resample audio very often).
    • Really in this case, we don't need memory orderings at all since we're just using the flags to trigger events at the right time, not for memory safety or logical correctness.

Speeding up atomic flags

I'm actually not sure if you can change Flag while keeping it 100% compatible with the current behavior (unlimited number of threads, sequential consistency, etc.). I have some ideas on how it can be more optimized around common use cases (two threads only, so I'm not sure it can safely replace all use cases). It would be interesting to benchmark to see if any change in FPS is detectable, but I don't know how to benchmark Dolphin.

  • In common cases (where you set an atomic flag after finishing a task, and read it to check if the task is complete, such as mutex/semaphore-like patterns), you only need acquire and release orderings, not sequential consistency (which is slower). I cannot make a blanket statement that replacing all current uses of Flag with acquire/release orderings is correct.
  • In TestAndClear/Set(), you can probably perform an acquire read, and only compare_exchange_strong (and return whether the value was flipped) if the value is different from the target.
  • If you only need an acquire barrier if the value was changed, you can replace the initial acquire read with a relaxed read.
  • If one thread runs TestAndClear(), and all other threads only set the value to true, you can replace compare_exchange_strong with an acquire fence followed by a write (most likely a relaxed write, you can also use release, though I think other threads generally don't care what you were doing before you tested/set the flag).
    • Note that this can cause multiple threads calling TestAndClear() and perhaps Clear() to both erroneously indicate they cleared the flag.

@JMC47
Copy link
Contributor

JMC47 commented Jul 12, 2023

I'm not a coder, but all the debug commits that are reverted and their reverts should be squashed out. The commits that actually do different things and help with making it easier to review should stay.

@nyanpasu64 nyanpasu64 force-pushed the fix-ffw-audio-latency branch from 143bee7 to fe41898 Compare July 12, 2023 06:19
@nyanpasu64
Copy link
Contributor Author

I've removed the debugging commits, and squashed away the previous versions of my code which used a different atomic structure (keeping it would pollute git logs with WIP code I later replaced). For posterity I've pushed a tag to the full history in my fork (https://github.com/nyanpasu64/dolphin/tree/fix-ffw-audio-latency-4).

@JMC47
Copy link
Contributor

JMC47 commented Jul 12, 2023

This looks a lot easier to review at a glance. I think Filoppi said they'd review it next week, I'll defer any comments until after that.

Copy link
Contributor

@Filoppi Filoppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, I left some minor comments.

This should be tested on Windows and maybe a video showcasing the old vs new behaviour should be recorded, so we can all judge whether the new behaviour is better.
I worked on audio stretching a lot in the past, and this PR doesn't really affect that, but I do indeed remember there was some audio problems when releasing the fast forward button if stretching wasn't enabled.
I do think dropping samples is a good solution over here, it's likely better than keeping a slightly increased audio speed for ~10 seconds, though the best solution might be to quickly play back all the extra samples within 0.2 or so, which IIRC is what I implemented in my WIP PR #8983 (I'm planning to get back at it in the next couple months).

I'm also not sure about the s_is_throttler_temp_disabled atomic change, but I doubt it will damage anything.

@@ -88,6 +87,14 @@ unsigned int Mixer::MixerFifo::Mix(short* samples, unsigned int numSamples,
(static_cast<u64>(m_input_sample_rate_divisor) * 1000);
low_watermark = std::min(low_watermark, MAX_SAMPLES / 2);

// Throw away extra audio when the writer exits fast forward.
if (should_resync && numLeft > low_watermark)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this branch should be skipped if auto stretching is eanbled? We never really want to throw away samples in that case, we prioritize playing all of them, even if they play at a different "speed".

EDIT: nevermind, this is already skipped is audio stretching is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename consider_framelimit to something else? Though I don't have a better name, since I don't understand the fixed-pitch time-stretching path (and TBH I've forgotten the details of the buffer-resampling path as well).

@@ -371,6 +371,11 @@ void CoreTimingManager::Throttle(const s64 target_cycle)
const TimePoint min_deadline = time - max_fallback;
const TimePoint max_deadline = time + max_fallback;

if (speed <= 0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of explained in the PR description already, but could you further explain what it is for?
Also, should we skip this branch if audio stretching is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously m_throttle_deadline was not being updated during fast-forward, causing the emulator to try to render as many frames as possible in zero seconds. And when you released fast-forward, the emulator would continue to fast-forward for an emulated time equal to the physical time you held the key, waiting for m_throttle_deadline to catch up to the physical time. This was only limited when the "System can not to keep up with timings!" fallback would keep the target timebase 100ms before the current time, so the emulator would never fast-forward more than 100ms emulated time after you released Tab.

In my opinion, the previous behavior of fast-forwarding the emulated game for up to 100ms of emulated time after releasing the fast-forward key, was a bug. And you should stop fast-forwarding (begin throttling emulation using the current host time) as soon as you release the fast-forward key, with or without audio stretching enabled. This change accomplishes it.


Though another route to take would be to neither write nor read m_throttle_deadline while fast-forwarding, and when transitioning from fast-forwarding to regular speed, assign m_throttle_deadline = time (either before or after m_throttle_deadline += cycles...) to begin throttling from that time (my current PR's code sleeps until the time of the final fast-forward plus the cycles of the current non-fast-forward emulation block). These approaches will have subtly different timing characteristics when ending fast-forward.

  • It seems CoreTimingManager::Advance() calls throttle before running a computation at a given emulated/real time? I'm still not sure what the Advance() doc comment and PowerPC downcount mean.

Reassigning m_throttle_deadline after ending fast-forward, requires adding a variable to cache whether fast-forward was enabled or not on the previous Throttle call. I'm not sure if the result has more or less "correct" timing behavior, but realistically there's no difference if we advance timing thousands of times per second.

Adding a "was fast forwarding" variable adds code complexity, but the same variable can be used for "making CoreTimingManager inform the mixer when CoreTimingManager itself stops fast-forwarding" (another proposal).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downcount is a variable the JIT uses to keep track of how many emulated clock cycles are left until Advance needs to be called again. For instance, let's say that Advance is called and it decides that 2000 cycles are left until the next scheduled event. It will then set downcount to 2000 and return control to the JIT. As the JIT executes code, it will decrease the downcount variable little by little, and between each JIT block, if it notices the downcount is 0 or negative, it will call Advance again.

{
numLeft = m_numLeftI = low_watermark;
indexR = (indexW - 2 * low_watermark) & INDEX_MASK;
ASSERT(numLeft == static_cast<float>(((indexW - indexR) & INDEX_MASK) / 2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this verifying? A small comment would help in case this assert actually triggered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the code computes numLeft = ((indexW - indexR) & INDEX_MASK) / 2. When ending fast-forward and throwing away all but low_watermark frames of audio, we assign numLeft = low_watermark then rewrite indexR to low_watermark frames (or 2 * that samples) before the end of available audio. This assertion is just to check that I got my math right, and the indexR I write has the correct number of samples left; I copied the formula from the initial point we calculate float numLeft. (I'd hope that 2 * low_watermark is always less than INDEX_MASK, if not the current code would already break?)

Should I keep this assert or not? I'm not sure how to comment it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it, and add something like "not enough samples to continue"

m_streaming_mixer.Mix(samples, num_samples, true, emulation_speed, timing_variance);
m_wiimote_speaker_mixer.Mix(samples, num_samples, true, emulation_speed, timing_variance);
m_skylander_portal_mixer.Mix(samples, num_samples, true, emulation_speed, timing_variance);
const bool is_ffw = Core::GetIsThrottlerTempDisabled();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, variables shouldn't use acronyms. I might be wrong but I think it would be better. ffw is not clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, directly reading Core::GetIsThrottlerTempDisabled() here is likely fine, but maybe it could be passed in in a more thread friendly way? No big deal anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually determined when to reset audio on the main thread (though not in CoreTiming which actually handles fast-forwarding), in an earlier version of this code (nyanpasu64@61e7b78).

The existing code updates s_is_throttler_temp_disabled on HotkeyScheduler but reads it (with GetIsThrottlerTempDisabled()) on the CPU and Video threads.

I replaced s_is_throttler_temp_disabled with an atomic to make it safe to access on multiple threads (which is already happening, with or without my PR merged). But perhaps it's still more reliable for the main thread to set a flag (read by the mixer thread) when the main thread starts and stops fast-forwarding.

  • You'll most reliably detect when fast-forwarding starts and stops by making CoreTimingManager inform the mixer when CoreTimingManager itself stops fast-forwarding. But this is a nasty entanglement of concerns (CoreTimingManager shouldn't care about audio mixer resyncing?). Should I implement this change or not?
  • How does other code read s_is_throttler_temp_disabled? Right now, VideoConfig.cpp#IsVSyncActive() calls GetIsThrottlerTempDisabled() rather than checking whether CoreTimingManager has started or stopped fast-forwarding. Should we disable vsync for a frame if the user starts and stops fast-forwarding within one frame, causing a sub-frame shift in emulation phase relative to vsync? This is somewhat theoretical and unrelated to my code changes. I still hope that something like Add closed-loop latency control for Vulkan vsync #12035 can get merged.

@nyanpasu64
Copy link
Contributor Author

This should be tested on Windows

I'll do that when I have a chance.

maybe a video showcasing the old vs new behaviour should be recorded, so we can all judge whether the new behaviour is better.

I could record two videos, though editing them into one file will be tricky to figure out and involve a lot of frantic ChatGPT "how to join videos in ffmpeg", and might be harder to compare? (not sure)

I do think dropping samples is a good solution over here, it's likely better than keeping a slightly increased audio speed for ~10 seconds, though the best solution might be to quickly play back all the extra samples within 0.2 or so, which IIRC is what I implemented in my WIP PR #8983 (I'm planning to get back at it in the next couple months).

Playing back samples faster will result in an increase in audio pitch. I believe that garbled/dropped audio during fast forward, followed by increased pitch after releasing fast-forward, will be more jarring to the user than this PR (which produces garbled/dropped audio during and after fast-forward). I do wonder if we should enable time-squeezing (time-stretching but compressing) during fast forwarding.

#8983:

Improved audio quality by changing from linear to cubic interpolation, I was able to tell the difference in multiple tests.

something something mdfourier wii (I still haven't gotten around to it :(

When missing samples, in some occasions Dolphin now plays the last ~0.3ms of samples backwards, which are generally better than crackling and silence you would get during a larger stutter.

clever.

Mixer latency is now adaptive, meaning it has a range, from 0 to 40ms it is accepted, when it goes above 40ms it will try to go back to 20ms. This increases the quality overall as before the audio speed was constantly fluctuating around the target latency, without ever reaching it. Setting the max latency to 20-25ms is very doable if your emulation is fast and stable. You can customize the setting from the ini. It might be particularly useful in musical games.

Is this necessary? I don't know. With the current audio mixer on Linux, outside of game startup or Dolphin lagging I didn't observe any noticeable pitch change, since numLeft jitters by around 20 (audio frames) from read to read, which should result in an inaudible ±4 Hz of pitch shift (CONTROL_FACTOR = 0.2f), reduced further by CONTROL_AVG smoothing. I didn't test if it's worse on Windows. Do you want to avoid pitch changing when Dolphin lags as well?

I didn't review your code. But do you handle homebrew which outputs DSP audio at 48khz (the official SDK ran at 32khz)?


I'm also not sure about the s_is_throttler_temp_disabled atomic change, but I doubt it will damage anything.

#12038 (comment)

The existing code updates s_is_throttler_temp_disabled on HotkeyScheduler but reads it (with GetIsThrottlerTempDisabled()) on the CPU and Video threads.

I replaced s_is_throttler_temp_disabled with an atomic to make it safe to access on multiple threads (which is already happening, with or without my PR merged).

@Filoppi
Copy link
Contributor

Filoppi commented Jul 25, 2023

You don't need to make a single movie file recorded from dolphin. Just record your desktop or something.

I'm not sure what you mean by time squeezing (stretching?), but I don't think it's necessary if the user didn't enable audio stretching.

I think the adaptive mixer latency was helpful yes, though we can further discuss that in the other PR.

Previously `m_throttle_deadline` was not being updated during
fast-forward, causing the emulator to try to render as many frames as
possible in zero seconds. And when you released fast-forward, the
emulator would continue to fast-forward for an emulated time equal to
the physical time you held the key, waiting for `m_throttle_deadline` to
catch up to the physical time. This was only limited when the "System
can not to keep up with timings!" fallback would keep the target
timebase 100ms before the current time, so the emulator would never
fast-forward more than 100ms emulated time after you released Tab.

Instead update `m_throttle_deadline` to track wall-clock time, so
emulation resumes throttling as usual once fast-forward ends.
Fixes excessive audio being buffered and audio going sharp after
fast-forwarding.
@nyanpasu64 nyanpasu64 force-pushed the fix-ffw-audio-latency branch from fe41898 to dde6a61 Compare July 27, 2023 23:14
@nyanpasu64
Copy link
Contributor Author

I'd keep it, and add something like "not enough samples to continue"

I've commented the audio skipping code and assert (though differently from your suggestion). Is that sufficient?

This does not fix audio going flat when the game starts, or going sharp if the game pauses briefly and afterwards CoreTiming fast-forwards to compensate.

Oddly on Windows exclusive WASAPI and Cubeb, audio goes sharp when the game (specifically Super Monkey Ball 2) starts, whereas on Linux (IIRC) PulseAudio it went flat in my earlier testing. I don't know if that's because I had tested on a slightly older base commit in master.

I often get lag during an in-game loading screen that reads data from the ISO, because I put my games on a secondary hard drive which apparently spins down during inactivity. Afterwards the audio often goes sharp once Dolphin fast-forwards to catch up.

reset buffer size after discontinuities are resolved and CoreTiming stabilizes.

If we did so, after Dolphin lags and then speeds up to catch up, it will skip audio once. I'm not sure if this is more or less noticeable than a period of increased audio latency, accompanied by slightly sharp sound (depending on the length of the lag spike, either briefly or up to ~8 seconds then slowly returning to normal). I'd say skipping audio is arguably more distracting to the user, unless you only do it upon silence. Honestly I think the best way to prevent audio from piling up is to reduce how long CoreTiming fast-forwards games after a lag spike, but that's a topic for another day.

I verified the PR works on Windows as well. I've made some video recordings with resampling ratio debug prints enabled:

The pitch shift is slightly perceptible. When playing Mario Sunshine, the increased audio latency after fast-forwarding was more obvious:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants