-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
af_scaletempo2: fix audio-video de-sync caused by speed changes #12052
Conversation
I've been using this patch since I posted it and found no issues. Here are some additional test results I got: I compared a/v sync of the three tempo filters by switching back and forth between any pairs of
The results were consistent at every speed I tried, whether 1x, 1.01x, 2x, or 0.5x. Lastly, after running If anyone has more ideas what needs to be tested I would be glad to try. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works as described
Though if you look at the audio packet arriving in player/audio.c:ao_process |
I can reproduce the overlap/gaps and change of overlaps. I see jitter with scaletempo(1) and rubberband too, but it's much smaller and more consistent. Here are my test results:
It's fine at 1x, where the scaletempo2 is in pass-through mode and keeps buffers consistently filled, so there's no jitter. I suspect there's something wrong with my assumption with the scaletempo2 input buffer. If I remove the |
Fixes mpv-player#12028 There was an additional issue that audio was always delayed by half the configured search-interval. This was caused by the `out` buffer length not being included in the delay calculation. Notes: - Every WSOLA iteration advances the input buffer by _some amount_, and produces data in the output buffer always of size `ola_hop_size`. - `mp_scaletempo2_fill_buffer` is always called with `ola_hop_size` - Thus, the rendered frames are always cleared immediately after processing, and `num_complete_frames` is 0 in the delay calculation. - The factors contributing to delay are: - the pending samples in the input buffer according to the search block position, and - the pending rendered samples in the output buffer (always empty in practice). The frame_delay code looked like that of the rubberband filter, which might not work for scaletempo2. Sometimes a different amount of input audio was consumed by scaletempo2 than expected. It may have been caused by speed changes being a more dynamic process in scaletempo2. This can be seen by where `playback_rate` is used in `run_one_wsola_iteration`: `playback_rate` is only referenced after the iteration, when updating the time and removing old data from buffers. In scaletempo2, the playback speed is applied by changing the amount the search block is moved. That apparently averages out correctly at constant playback speed, but when the speed changes, the error in this assumption probably spikes. This error accumulated across all speed changes because of the persistent `frame_delay` value. With the removal of the persistent `frame_delay`, there should be no way for the audio to drift off. By deriving the delay from filter buffer positions, and the buffers are filled only as much as needed, the delay always stays within buffer bounds.
a549e29
to
ccedeb0
Compare
I found the problem and updated the commit. I added two more commits fixing the sources of artifacts when entering or leaving 1x speed. The second commit was needed for consistency of the delay calculation at 1x speed. With this change I see very consistent gaps no more than 20µs in magnitude between packets in With the additional fixes, when running with The cause for the jitter was using The end of the skipped audio is always in the search block, but the precise end position varies. Using the start of the search block appears to give correct results. I tested this by increasing either A/V sync is still very consistent with scaletempo(1) as far as I could tell. |
Good work. |
The reset to 1 should be clean now, if you have
I'm not sure about this. As far as I could tell, speed changes are very clean in scaletempo2. There was the 1x case, but that's clean now as well. In any case, maybe it's caused by
I'm not sure how much there is to reset without introducing new artifacts. Same point about There are two things I think would be worth looking into:
|
I run resampling near speed 1 and scaletempo2 for other speeds as is default. |
Would it be viable to calculate the speed according to the amount of data processed in the input buffer? My reasoning: since speed updates are more dynamic in scaletempo2, it might work better to reflect this in the emitted packets. Currently we try to match output packets with the value of the speed property, which doesn't correspond to the actual processing of the filter, as we noticed. I made a proof-of-concept on my branch fix-scaletempo2-desync-nogaps. It has a couple issues, but otherwise I don't get any gaps any more.
Reworking the Sidenote: in my second-to-last commit 77208a3 I fixed another problem in the initial value of |
Please check my experimental branch fix-scaletempo2-desync-time-rewrite
Cannot guarantee that I didn't make any memory safety mistakes. I'll check it tomorrow. Also need to update some comments. |
I will try and test your code soon. |
I double-checked the code and cleaned up my changes. Since output seems cleaner than before and I couldn't find any issues with all kinds of playback situations, I pushed my update to the PR branch. I spent more time looking for gaps in the produced audio. I don't see any gaps introduced larger than 20 microseconds. These are expected because The only gaps I did see were from changing other filters in the Now testing the original output of the filter, I noticed that there have always been a lot of gaps especially for fractional speeds. With my timing rewrite the situation is a lot better. |
I have now done a little testing with your new code. Also, as my code did a scaletempo2_reset when speed was changed I noticed that probably something have changed so reset does not fully do a reset. If I do that the pts frame delay calculation is off about 25 ms. So I disabled this during my tests. So while pts calculation is better something have changed in the audio out generated. As I want the audio out pts values to be same length as audio in, from filter, currently is looks better for me as it was before your changes in the internals. |
Good find with the reset, I found one thing I overlooked when I fixed the Update: I think there may be something to do about The increased amount of data lost because of this could be explained by the change in a/v sync, but I'm not sure about this. If that's correct, I suspect the fix would hurt responsiveness of filter changes at low playback speeds. It's hard for me to understand the problem because I don't know how to reproduce it. |
From what I can see it looks like in final handling it loops until mp_scaletempo2_frames_available says the no frames are available. So I think it will empty what the internals wants to give. |
Actually it is backspace you press to reset speed to 1.0, not DEL |
I changed the final event handling to allow more iterations until the last frames have been emitted. It's like I thought: the gap was because of audio read from filter input not being written out after EOF. It's not possible to completely fill the gap; it's the same for the other filters. There are technical problems for that: When the filter is removed when going from 2x to 1x speed, the filter isn't notified of the new speed. As the filter doesn't know why the input ended, it can't produce the remaining audio of the exact length, because it shouldn't assume 1x speed for the actual end of file, for example. As far as I could tell, the gaps are now consistent with scaletempo. rubberband has much different timing behavior and doesn't seem to care about the gaps. Note: When testing with scaletempo or rubberband, it's not possible to test the filter removal gap behavior for 1x speed, because they will always stay active when explicitly in the
This is probably because the Memory access issue in final packet handlingThere was also a serious memory access issue in the handling of the final output packet: A missing dereference wasn't noticed, which caused wrong data to be overwritten with zeros. I added a separate commit for the fix, but the code is rewritten in the next commit anyways. It probably never got noticed because the end of files is often very quiet, the written area was luckily in the audio area instead of the pointers of the 2d area, and it's hard to notice a problem in the last 10ms of output. |
Good. It looks like you found a way to flush the remaining data in filter upon "final". The code before with a final flag for fill of input buffer did not work as I expected. |
Thinking about this a bit, I realized that after returning to 1x speed, there is now a constant offset, because the How about realigning the timing with the current This would produce a gap in output packets, up to I'll make some tests later. I think the easiest way would be to move the search block to the |
As I do not use the filter for speed = 1 I do not test that. which works more like my own code for it. |
When a seek is done, should the filter do a reset? |
What is not correct exactly? Do you mean the calculation, or is it that I specify a lower amount of samples to truncate it while passing the non-truncated buffer? As far as I could tell, the number of frames are calculated correctly in my code, but in most cases there are too few remaining input samples in the buffer to produce that many frames. The central problem here is that the target block position is very fluid while the filter is active, anywhere up to I quickly ruled out producing more silence as a solution, because it resulted in more jarring clicks in the audio when the filter is removed. I concluded that a pts gap is more favorable than producing silence. I suppose we could alternatively calculate a speed value for frames after
As far as I can tell, seeks always run the |
I've added the fix for exact sync when returning to 1x speed, for anyone who keeps the filter active during normal playback. It also drops unneeded frames at the start of the input buffer, which saves some cycles when seeking the buffer. |
0e125b3
to
b3b4d75
Compare
Overlooked the |
Your code with truncating is correct. I did a copy past into my code and got one thing wrong. Though I have added rounding to it as that will get less gaps.
With that it works fine for me. |
I tested the code and the pts-gaps were fixed. Sadly I can notice a slight difference with the clicks. To confirm it, I changed the default values for I wonder if the removal of pts-gaps (in particular for filter removal) is that important. The only problem I can think of is video-sync. If set to As I understand it: pts-gaps don't cause clicks per se, because mpv plays the audio seamlessly. The reason clicks happen in that moment is that, when skipping over audio, the frames before and after the gap don't mesh together. Adding silence makes the problem worse, because now it can produce a click at the start and the end of a gap, and play muted audio for a moment. As I currently stand, I think it's better to avoid producing silence. IMO, feeding the filter with silence after EOF like it currently does is a bit dubious too, for the same reasons. But at least the amount is limited to one |
Maybe the clicks are depending on what audio out one is using. |
I'm using pipewire. I get the clicks with native pipewire output, pulseaudio output, or alsa output. They all go through pipewire though. |
`output_time` is used to set the center of the search block. Init of both `search_block_index` and `output_time` with 0 caused inconsistent search block movement for the first iterations. Initialize with `search_block_center_offset` for consistency with initial `search_block_index`.
`read_input_buffer` needs to respect the `target_block_index`, otherwise the audio resumes at the wrong position.
The first WSOLA iteration overlapped audio with whatever was in the `wsola_output` buffer. This was either silence (if not run before), or old frames (if switching to 1x and back to a different speed). Track the state of the output buffer and memcpy the whole window for the first iteration instead.
I have done some more tests, starting from mpv master. |
Ok, thank you, I think we might be talking about the same thing with distortions/clicks.
I agree 100%. I've thought about this as well, but it would be too complicated here. We could create a separate issue and I might look into it another time though. I need to take a break from this for a bit :) Currently cleaning stuff up and testing everything in my daily usage. Will probably update tomorrow. |
The internal time update function involved multiple problems: - Time was updated after WSOLA iteration. The means speed was updated one iteration later than it could be. - The update functions caused spikes of too many or too few samples advanced, leading to audio glitches on speed changes. - The inconsistent updates made it very difficult to produce gapless audio packets. - The `output_time` update function involved complicated feedback: `search_block_index` influenced how many frames from `input_buffer` are retained, which influenced how much `output_time` is changed, which influenced `search_block_index`. With these changes: - Time is updated before WSOLA iterations. Speed changes are effective instantly. - There are no spikes in playback speed during speed changes. - No significant gaps are introduced in output packets. - The time update function becomes (function calls omitted for brevity) output_time += ola_hop_size * playback_rate Functions received a `playback_rate` parameter to check how many samples are needed before iteration. Internal state is only updated when the iteration is actually run, so the speed is allowed to change until enough data is received.
Target block can be anywhere in the previous search-block, varying by `search-interval` while the filter is active. This resulted in constant audio offset when returning to 1x playback speed. - Move the search block to the target block to sync up exactly. - Drop old frames to minimize input_buffer usage.
This changes the emitted pts values from the start of the search block to the center of the search block. Change initial `output_time` accordingly. Initial `search_block_index` is irrelevant, because it's overwritten before the first iteration. Using the `output_time` removes the rounding of `search_block_index`, which also fixes the <20 microsecond gaps in timestamps between output packets. Rationale: The variance in audio position was in the range `0..search-interval`. With this change, the range is (- search-interval / 2)..(search-interval / 2)` which ensures lower maximum offset.
c97d402
to
b0425b0
Compare
I did my best to clean everything up; here's a (hopefully final) summary of all the changes in this PR:
|
Concerning 520e6b6, here's a way to reproduce a segfault that has been fixed:
This creates an extremely short audio file with 8 channels. When playing this back with It didn't crash with longer audio files, because the |
Can confirm that the desync and the audio artifact when switching from 1x to non-1x speed are fixed by this. I always thought that audio artifact was because of my conditional profile to switch from display-tempo to display-resample when the speed is close to 1, but apparently it was a problem with scaletempo2 and I'm glad you fixed it 👍 I'm not at all familiar with the relevant code though. Maybe I'll have a look when I find some time, but not sure I can provide any valuable feedback there. |
I did some local testing and everything appears to be working as expected. However, the details of this filters internals are well over my head from a quick glance. Perhaps @DorianRudolph can give a quick comment, seeing as they initially authored this filter? Other than that, it looks as though @ferreum is the de-facto maintainer of this filter, so I'm inclined to just merge it without further review. |
I think at this point ferreum and christoph have a better understanding of this than me. But it sounds and looks good to me, though I don't notice a difference :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at the last two commits tomorrow, the rest looks fine.
Commit 3 could have been part of commit 1.
In mp_scaletempo2_init()
search_block_center_offset
gets set to 0 at the top and then to something else further down without any reads in between, but that's not your fault.
@@ -702,7 +702,15 @@ int mp_scaletempo2_fill_buffer(struct mp_scaletempo2 *p, | |||
// Optimize the most common |playback_rate| ~= 1 case to use a single copy | |||
// instead of copying frame by frame. | |||
if (p->ola_window_size <= faster_step && slower_step >= p->ola_window_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must be missing something here, but if speeds very close to 1 get treated as if speed is exactly 1, doesn't that mean that the output will eventually get out of sync because there is never any overlap happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wasn't sure about this check, so I didn't touch it. As I understand it, the whole condition is a complicated way to check if playback_rate
rate is within 1 +/- (1/ola_window_size)
. I have no idea why it's done with all the ceilf()
and division/multiplication. It originates from the chrome code.
As to de-synchronization, I just realized that could actually be a problem. Assuming 44100Hz sample rate, the filter becomes active around 1.001134x. A quick calculation (units '100ms*882' time
) leads me to merely 1:28 minutes playback for 100ms difference (assuming worst-case speed like 1.00112x). In usual video-sync
modes this gets corrected in some way, but I think we should adjust this to improve sync.
I believe there should still be an epsilon to guarantee perfect audio playback near 1x
speed. If I use multiply
to change speed back and forth with a skewed number and its reciprocal, some error keeps accumulating, but I really have to try to get it above 1e-14
. I'd suggest a safe 1e-10
(>31 years for 100ms desync).
I also just realized that the filter's p->speed
is still a float
, so that needs to be changed to double
for this to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming 44100Hz sample rate, the filter becomes active around 1.001134x
That would be with default parameters right? I use window-size=40
so that would be even worse.
We could leave the epsilon as is, but add a check to to occasionally do an overlap iteration to get back in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be with default parameters right?
Exactly. Adjusting the parameters would change the threshold. Actually, increasing the window-size increases the resolution (the epsilon is 1/ola_window_size
, not ola_window_size
), so a larger value results in earlier activation of the filter.
We could leave the epsilon as is, but add a check to to occasionally do an overlap iteration to get back in sync.
That's effectively what would happen when we completely remove this case. The filter keeps overlapping half-window-sized blocks with the same data, resulting (theoretically) in no audio change. That's until the target_block is not in the search window, which is when the WSOLA actually does its work--essentially moving the target block to the best-matching position within the search window.
I think this speaks in favor of reducing the epsilon to a very small value. If you want to try it, I've pushed a commit to a separate branch for testing (last commit). I've not found a way to even measure the difference though. I presume the hardware introduces more timing difference than the effects we're discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's effectively what would happen when we completely remove this case.
Yes exactly, the whole point of that if
is that it's an optimization right? But the way it is right now it's not only faster but also changes the behavior.
Instead of checking for speed being close to 1, we could check for if WSOLA would actually do something (target block is within search window, as you said). Maybe even with an epsilon here so that it still takes the shortcut if it's only a little bit outside of that or something. That would technically still be a change in behavior but as long as it doesn't drift off too much (because it's bounded) that shouldn't be a problem.
If we make the epsilon tiny it'll never get to take that shortcut with the usual video-sync modes because the drift compensation puts the speed too far away from 1. (proabably, didn't test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/media/filters/audio_renderer_algorithm.cc#252
Looks like chromium is still doing the same thing. Too bad, otherwise we could have taken their solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can vaguely think of a way your suggestion might work as a general optimization, but I'm not sure how much would need to be changed. Essentially I would move/duplicate the target_is_within_search_region(p)
check here and only go through the run_one_wsola_iteration
if the target block is outside the search region.
I can make some tests to work out the details and see if this is viable, but no guarantees.
The (small) epsilon would still be useful regardless of the optimization, because at the same time the filter reverts any audio offset it introduces.
I'm not sure if the optimization is worth whatever changes it requires though. I didn't profile it, but it seems very efficient as-is. It looks like micro-optimization where I'm standing.
Additionally I think these tiny speed differences are not relevant in practice, because with video-sync=display-*
, I usually see much larger magnitudes of audio-speed-correction
. And with video-sync=audio
there should be no offset, as I understand it; playback speed is only by that miniscule amount slower or faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it as is for now. Since it's not a problem with any kind of video-sync
that actually matters. It will only drift apart for sync modes that are already expected to drift apart, so even then nobody would actually notice.
It might still be a problem for tiny values of video-sync-max-audio-change
, but I don't think anyone touches that anyway.
Either way I'd like to get this merged, and a fix for that drift would then be a separate PR.
(if you care enough to actually do something about it, totally understandable if you don't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While trying to find out why that reset is there I've noticed that target_block_index
never gets initialized. The reset function sets it to 0, but that doesn't happen when the filter starts getting used.
audio/filter/af_scaletempo2.c
Outdated
} else if (format_change) { | ||
// go on with proper reinit on the next iteration | ||
p->initialized = false; | ||
p->sent_final = false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why leave an empty if? It adds some context to the comment, but that context could be added to the comment itself. for format change go on with...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; I only saw the redundant lines and removed them.
Btw, should I continue rewriting affected commits that need these improvements, or should I push fixes as new commits? I'm not sure if rewriting would make reviewing difficult at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the already existing commits please.
The uninitialized variable thing would be a separate commit.
// truncate final packet to expected length | ||
if (out_samples >= max_samples) { | ||
out_samples = max_samples; | ||
mp_scaletempo2_reset(&p->data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that reset necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the mp_scaletempo2_set_final
method is called, the internals may produce more audio than expected, depending on the target block position and which blocks are chosen if WSOLA runs. The reset clears all this internal state to guarantees that mp_scaletempo2_frames_available
returns false again.
After that, processing can continue as normal again. E.g. if there's a format change, then the first while
loop of process()
will reinitialize the internals with the new format (first iteration sets p->initialized
to false and the next loop calls init_scaletempo2
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes more sense now. A comment explaining that would be good.
Good find. Additionally Things seem to work out because the struct seems to be initialized with zeroes. Is it a problem to rely on that? Simplest fix would probably be to split the init function in two, and NULL all the pointers only the first time. Edit: |
After the final input packet, the filter padded with silence to allow one more iteration. That was not enough to process the final frames. Continue padding the end of `input_buffer` with silence until the final frames have been processed. Implementation: Instead of padding when adding final samples, pad before running WSOLA iteration. Count number of added silent frames and remaining input frames for time keeping.
Avoid generating too much audio after EOF. Note: This often has no effect, because less audio is produced than required. Usually this comes to effect with the userspeed filter at high speed (4x) and going back to 1x speed to remove the filter.
b0425b0
to
d1eca4e
Compare
Updated the last 2 comments and added another:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long until I reviewed this.
LGTM
Update: Apologies for how big this PR has gotten. It turns out there were a lot of unnoticed issues in scaletempo2, which are fixed here. There are detailed commit messages, but see my comment for a summary.
Original text:
This is what I came up with to fix #12028. It calculates the scaletempo2 delay based on the current buffer positions.
My testing consisted of the process described in #12028. With normal settings, the audio seems fine both with the new code, and without delay calculation (by removing the pts adjustment altogether as described in the issue). I find it impossible to notice the differences during speed changes in any case.
window-size delay
There was an additional issue that audio was always delayed by half the configured search-interval. Include
ola_hop_size
in the delay to compensate for that.I noticed that by increasing the search-interval and window-size to impractically large values:
(Note: With these settings, repeated speed changes result in very inconsistent output with both the old and new calculation, but it helps to test the delay during constant replay.)
In the original code, I found that the audio was played back about half the search-interval too late (i.e 0.5s with the above settings) even without changing playback speed. The fix was to adjust by
ola_hop_size
in the calculation.I'm not sure where the
ola_hop_size
delay comes from. It's either theout
buffer, or from the overlap in the scaletempo2 internals.Additional notes
Here are some things I observed. If any of these are wrong, it may mean there is an issue in the new delay calculation:
ola_hop_size
.mp_scaletempo2_fill_buffer
is always called withola_hop_size
num_complete_frames
is 0 in the delay calculation.|target_block| is the "natural" continuation of the output
. The delay comes from the length of audio that the filter is holding back.ola_hop_size
The
frame_delay
code looked like that of the rubberband filter, which might not work for scaletempo2. Sometimes a different amount of input audio was consumed by scaletempo2 than expected. It may have been caused by speed changes being a more dynamic process in scaletempo2. This can be seen by whereplayback_rate
is used inrun_one_wsola_iteration
:playback_rate
is only referenced after the iteration, when updating the time and removing old data from buffers.In scaletempo2, the playback speed is applied by changing the amount the search window is moved. That apparently averages out correctly at constant playback speed, but when the speed changes, the error in this assumption probably spikes. This error accumulated across all speed changes because of the persistent
frame_delay
value.With the removal of the persistent
frame_delay
, there should be no way for the audio to drift off. By deriving the delay from filter buffer positions, and the buffers are filled only as much as needed, the delay always stays within buffer bounds.