Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

af_scaletempo2: fix audio-video de-sync caused by speed changes #12052

Merged
merged 12 commits into from
Sep 20, 2023

Conversation

ferreum
Copy link
Contributor

@ferreum ferreum commented Jul 30, 2023

Update: Apologies for how big this PR has gotten. It turns out there were a lot of unnoticed issues in scaletempo2, which are fixed here. There are detailed commit messages, but see my comment for a summary.

Original text:

This is what I came up with to fix #12028. It calculates the scaletempo2 delay based on the current buffer positions.

My testing consisted of the process described in #12028. With normal settings, the audio seems fine both with the new code, and without delay calculation (by removing the pts adjustment altogether as described in the issue). I find it impossible to notice the differences during speed changes in any case.

window-size delay

There was an additional issue that audio was always delayed by half the configured search-interval. Include ola_hop_size in the delay to compensate for that.

I noticed that by increasing the search-interval and window-size to impractically large values:

--af=scaletempo2=search-interval=100:window-size=1000

(Note: With these settings, repeated speed changes result in very inconsistent output with both the old and new calculation, but it helps to test the delay during constant replay.)

In the original code, I found that the audio was played back about half the search-interval too late (i.e 0.5s with the above settings) even without changing playback speed. The fix was to adjust by ola_hop_size in the calculation.

I'm not sure where the ola_hop_size delay comes from. It's either the out buffer, or from the overlap in the scaletempo2 internals.

Additional notes

Here are some things I observed. If any of these are wrong, it may mean there is an issue in the new delay calculation:

  • Every WSOLA iteration advances the input buffer by some amount, and produces data in the output buffer always of size ola_hop_size.
  • mp_scaletempo2_fill_buffer is always called with ola_hop_size
  • Thus, the rendered frames are always cleared immediately after processing, and num_complete_frames is 0 in the delay calculation.
  • The input buffer expression makes sense as the header comment states, |target_block| is the "natural" continuation of the output. The delay comes from the length of audio that the filter is holding back.
  • The factors contributing to delay are:
    • the pending samples in the input buffer,
    • the pending rendered samples in the output buffer, and
    • an amount of ola_hop_size

The frame_delay code looked like that of the rubberband filter, which might not work for scaletempo2. Sometimes a different amount of input audio was consumed by scaletempo2 than expected. It may have been caused by speed changes being a more dynamic process in scaletempo2. This can be seen by where playback_rate is used in run_one_wsola_iteration: playback_rate is only referenced after the iteration, when updating the time and removing old data from buffers.

In scaletempo2, the playback speed is applied by changing the amount the search window is moved. That apparently averages out correctly at constant playback speed, but when the speed changes, the error in this assumption probably spikes. This error accumulated across all speed changes because of the persistent frame_delay value.

With the removal of the persistent frame_delay, there should be no way for the audio to drift off. By deriving the delay from filter buffer positions, and the buffers are filled only as much as needed, the delay always stays within buffer bounds.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 4, 2023

I've been using this patch since I posted it and found no issues.

Here are some additional test results I got:

I compared a/v sync of the three tempo filters by switching back and forth between any pairs of scaletempo <-> scaletempo2 <-> rubberband. I did this with the console, by running af add @tempo:scaletempo2, af add @tempo:scaletempo, and af add @tempo:rubberband, each command replacing the current filter.

  • rubberband seems to have a little more audio delay than scaletempo and scaletempo2: The audio skips forward when switching away from rubberband, and vice versa.
  • audio delay difference between scaletempo and scaletempo2 is almost indistinguishable
  • Without this patch, I noticed the slight unaccounted delay in scaletempo2 when switching between scaletempo2 and scaletempo, which has been fixed in this patch.

The results were consistent at every speed I tried, whether 1x, 1.01x, 2x, or 0.5x.

Lastly, after running af add @tempo:scaletempo2, I ran af toggle @tempo repeatedly at 1x speed to compare its passthrough implementation against unfiltered output. That didn't show any difference in delay either.

If anyone has more ideas what needs to be tested I would be glad to try.

sfan5
sfan5 previously approved these changes Aug 6, 2023
Copy link
Member

@sfan5 sfan5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works as described

@DanOscarsson
Copy link
Contributor

Though if you look at the audio packet arriving in player/audio.c:ao_process
they more often now come with a overlap or gap in the pts values.
To be really good the audio packets should arrive with the pts values in
a continues flow. I do not know if they are big enough to disturb a/v-sync in some cases like display-resample.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 7, 2023

I can reproduce the overlap/gaps and change of overlaps. I see jitter with scaletempo(1) and rubberband too, but it's much smaller and more consistent.

Here are my test results:

scaletempo 1x: consistent alternating +-0.000050
scaletempo 2x: consistent alternating +-0.000050, sometimes +-0.000051
rubberband 1x: consistent +-0.000000
rubberband 2x: consistent +-0.000045
rubberband 4x: consistent +-0.000000
scaletempo2 1x: consistent +-0.000000
scaletempo2 2x: +-0.001000..+-0.006000

It's fine at 1x, where the scaletempo2 is in pass-through mode and keeps buffers consistently filled, so there's no jitter.

I suspect there's something wrong with my assumption with the scaletempo2 input buffer. If I remove the p->data.input_buffer_frames - p->data.target_block_index term, there is very little jitter at every speed, below +-0.000060, as far as I have seen so far. Can't say anything about the a/v sync yet.

Fixes mpv-player#12028

There was an additional issue that audio was always delayed by half the
configured search-interval. This was caused by the `out` buffer length
not being included in the delay calculation.

Notes:
- Every WSOLA iteration advances the input buffer by _some amount_, and
  produces data in the output buffer always of size `ola_hop_size`.
- `mp_scaletempo2_fill_buffer` is always called with `ola_hop_size`
- Thus, the rendered frames are always cleared immediately after
  processing, and `num_complete_frames` is 0 in the delay calculation.
- The factors contributing to delay are:
  - the pending samples in the input buffer according to the search
    block position, and
  - the pending rendered samples in the output buffer (always empty in
    practice).

The frame_delay code looked like that of the rubberband filter, which
might not work for scaletempo2. Sometimes a different amount of input
audio was consumed by scaletempo2 than expected. It may have been caused
by speed changes being a more dynamic process in scaletempo2. This can
be seen by where `playback_rate` is used in `run_one_wsola_iteration`:
`playback_rate` is only referenced after the iteration, when updating
the time and removing old data from buffers.

In scaletempo2, the playback speed is applied by changing the amount the
search block is moved. That apparently averages out correctly at
constant playback speed, but when the speed changes, the error in this
assumption probably spikes. This error accumulated across all speed
changes because of the persistent `frame_delay` value.

With the removal of the persistent `frame_delay`, there should be no way
for the audio to drift off. By deriving the delay from filter buffer
positions, and the buffers are filled only as much as needed, the delay
always stays within buffer bounds.
@ferreum
Copy link
Contributor Author

ferreum commented Aug 8, 2023

I found the problem and updated the commit. I added two more commits fixing the sources of artifacts when entering or leaving 1x speed. The second commit was needed for consistency of the delay calculation at 1x speed.

With this change I see very consistent gaps no more than 20µs in magnitude between packets in player/audio.c:ao_process, which is well within the range of the other filters.

With the additional fixes, when running with --af=scaletempo2, so far I didn't notice any artifacts anymore when switching to or from 1x speed. With the default config there are still clicks from the filter being inserted or removed automatically.

The cause for the jitter was using target_block_index. Using the search block position works much better. That's because target_block_index is the point immediately before part of audio that is skipped on the next iteration, which varies a lot in length.

The end of the skipped audio is always in the search block, but the precise end position varies. Using the start of the search block appears to give correct results. I tested this by increasing either window-size or search-interval to 1000 and leaving the other at its default value. (Changing speed is wonky with these settings, so test at 1x or 1.01x)

A/V sync is still very consistent with scaletempo(1) as far as I could tell.

@DanOscarsson
Copy link
Contributor

Good work.
Yes it works better, but from what I see there are jumps when speed is changed or filter is removed (like reset speed to 1).
I think one problem is that scaletempo2 (and scaletempo) can produce audio for audio it has not received yet, or not enough audio for the amount that is has received. So if scaletempo2 is removed audio will be missing or overlap.
Also it looks like changing speed results in some audio lost/added. Best would be to reset scaletempo2 internals each time speed is changed.
I have been looking at this for some time and have tested code for it but it takes time to find all special cases (like changing speed, seeking and removing filter).

@ferreum
Copy link
Contributor Author

ferreum commented Aug 9, 2023

@DanOscarsson

from what I see there are jumps when speed is changed or filter is removed (like reset speed to 1).

The reset to 1 should be clean now, if you have scaletempo2 explicitly in the af list; that's the key point. Otherwise, there is a jump when the filter is added or removed by mpv when switching from/to 1x. I thought that's part of updating the filter graph, but I made some tests and can't tell what causes it. I can reproduce jumps by switching back and forth between scaletempo and rubberband, so it's not only scaletempo2.

Also it looks like changing speed results in some audio lost/added.

I'm not sure about this. As far as I could tell, speed changes are very clean in scaletempo2. There was the 1x case, but that's clean now as well. In any case, maybe it's caused by update_output_time working in strange ways.

Best would be to reset scaletempo2 internals each time speed is changed.

I'm not sure how much there is to reset without introducing new artifacts. Same point about update_output_time applies.

There are two things I think would be worth looking into:

  • The update_output_time function. I haven't fully understood what's happening with p->output_time there. It's multiplied with playback_rate and used for the new window position, which may be a cause for overlaps.
  • The time is updated after the WSOLA iteration. That means new speed is applied one iteration later than it could be. It also means the delay calculation uses the wrong p->speed for that first iteration; I'm not sure how significant that is. I thought about trying to move the update_output_time and remove_old_input_frames calls to the start of the iteration, but that was too deep into the code for me so far.
    • This would require passing the speed as parameter to the checks for how much data needs to be read. It then would have to handle changing speed during that (for the case process is called but not enough data being available), so I guess the update_output_time and remove_old_input_frames should only happen after all checks have passed, when running the actual WSOLA iteration. I'm not sure if that's currently feasible.

@DanOscarsson
Copy link
Contributor

I run resampling near speed 1 and scaletempo2 for other speeds as is default.
In my test code I add silence if needed when removing scaletempo2 or changing speed (and reset scaletempo2 when changing speed). I do not notice the silence inserted.
In my test code I calculate start pts of output audio by using start pts of first incoming audio packet and then just continuing pts values without any delay calculation.
This way audio pts have no gaps.
I did now a quick test comparing my start pts of the out packet with your frame delay based calculation. Interesting the difference between your and my setting of start pts is nearly exactly 25 ms with just less then 100 microseconds variation.
So unless I did some mistake in my quick test code both our ways to calculate pts are consistent. The 25 ms difference will probably not be noticed by any normal person.
As I reset scaletempo2 on speed change I do not know if the difference would change after a speed change without a reset.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 10, 2023

Would it be viable to calculate the speed according to the amount of data processed in the input buffer?

My reasoning: since speed updates are more dynamic in scaletempo2, it might work better to reflect this in the emitted packets. Currently we try to match output packets with the value of the speed property, which doesn't correspond to the actual processing of the filter, as we noticed.

I made a proof-of-concept on my branch fix-scaletempo2-desync-nogaps. It has a couple issues, but otherwise I don't get any gaps any more.
The issues are:

  • Time in the filter can rewind on speed changes, which means the search block moves backwards and the amount of "consumed" data is negative. I don't think the filter should ever do that and would need a fix anyways. In my branch that results in negatve playback speed, which probably causes problems down the line. I think there are multiple ways to fix this.
  • I now worry that the spikes in speed for the emitted packet may be a problem. For a change from 1x -> 2x I've observed an emitted packet with 9x speed in between.

Reworking the update_output_time function as I suggested earlier may fix both the spikes and the time going backwards. Alternatively, we could fix time going backwards by adding a check to the function.

Sidenote: in my second-to-last commit 77208a3 I fixed another problem in the initial value of output_time. Initializing it to the center of the search window should be better, because that's what it's used to calculate. This fixes a window positioning problem for the first few WSOLA iterations, where the search block position became negative. I didn't check if that was even a memory access problem.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 10, 2023

Please check my experimental branch fix-scaletempo2-desync-time-rewrite
I rewrote the timing code to update time before the iteration, as described above. It seems to work a lot better than before:

  • No more gaps above 20 microseconds, even during speed changes.
  • No more one-iteration delay for speed changes.
  • Much simpler implementation of time keeping.

Cannot guarantee that I didn't make any memory safety mistakes. I'll check it tomorrow. Also need to update some comments.

@DanOscarsson
Copy link
Contributor

I will try and test your code soon.
I have not tried to understand the internals code and change it (except removing the muting of fast/slow speed like chromium did).
Instead I changed just in the af_scaletempo2.c code to give continues audio including the difficult places of seek, speed change and removal of filter. In all those cases I output remaining audio in buffers to correct pts and then reset/initialise scaletempo2 internals.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 11, 2023

I double-checked the code and cleaned up my changes. Since output seems cleaner than before and I couldn't find any issues with all kinds of playback situations, I pushed my update to the PR branch.

I spent more time looking for gaps in the produced audio. I don't see any gaps introduced larger than 20 microseconds. These are expected because output_time tracks time as double and is truncated when used as a buffer index. (see my commit message and updated comment in the header file for details)

The only gaps I did see were from changing other filters in the af list during playback. I'm pretty sure my changes haven't introduced this, as I could reproduce these with the master commit I started at.

Now testing the original output of the filter, I noticed that there have always been a lot of gaps especially for fractional speeds. With my timing rewrite the situation is a lot better.

@DanOscarsson
Copy link
Contributor

I have now done a little testing with your new code.
I have compared how it was before your changes apart from the new frame delay calculation for getting pts.
I am still my code for filling out audio data when the filter is removed.
Unless I got something wrong, my observations are:
Before your pts was about 25 ms off to audio input. Now it is about 0 ms.
Looks good, but when scaletempo2 is removed (like going to speed 1 using default config) the missing data is different.
I did three tests with different speeds and the then just back to speed 1.
Now when filter is removed I get between 36 and 44 ms audio data missing compared to input audio. Before the same tests missed 10 ms once and 0 ms for the other two.
So something have changed in how audio is generated by the filter.
I do not know it the difference can be heard or what have changed.
Now there have to be much more silence filled in to cover gap in audio when filter is removed. I do not know if that can be noticed when listening.

Also, as my code did a scaletempo2_reset when speed was changed I noticed that probably something have changed so reset does not fully do a reset. If I do that the pts frame delay calculation is off about 25 ms. So I disabled this during my tests.

So while pts calculation is better something have changed in the audio out generated. As I want the audio out pts values to be same length as audio in, from filter, currently is looks better for me as it was before your changes in the internals.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 12, 2023

Good find with the reset, I found one thing I overlooked when I fixed the output_time in init. It should be consistent with reset too now.

Update: I think there may be something to do about eof/final handling. The buffer is filled for one more iteration when the final event is received, but more than one iteration may be needed to reach the end.

The increased amount of data lost because of this could be explained by the change in a/v sync, but I'm not sure about this. If that's correct, I suspect the fix would hurt responsiveness of filter changes at low playback speeds. It's hard for me to understand the problem because I don't know how to reproduce it.

@DanOscarsson
Copy link
Contributor

From what I can see it looks like in final handling it loops until mp_scaletempo2_frames_available says the no frames are available. So I think it will empty what the internals wants to give.
Before your changes I see that in the start an audio packet is read and given to internals, then a number of reads from internals are done before a new audio packet is input. After your changes the difference is that less reads are done after first audio in packet before a new audio packet is input. So it looks like your code needs more audio data internally before wanting to output data.
In my tests I start with --no-config and --speed=1.1 and then press DEL are a short while to change speed to 1.0 which will remove the scaletempo2 filter.
You ought to see the gap in pts when the filter is removed.

@DanOscarsson
Copy link
Contributor

Actually it is backspace you press to reset speed to 1.0, not DEL

@ferreum
Copy link
Contributor Author

ferreum commented Aug 13, 2023

I changed the final event handling to allow more iterations until the last frames have been emitted. It's like I thought: the gap was because of audio read from filter input not being written out after EOF.

It's not possible to completely fill the gap; it's the same for the other filters. There are technical problems for that: When the filter is removed when going from 2x to 1x speed, the filter isn't notified of the new speed. As the filter doesn't know why the input ended, it can't produce the remaining audio of the exact length, because it shouldn't assume 1x speed for the actual end of file, for example.

As far as I could tell, the gaps are now consistent with scaletempo. rubberband has much different timing behavior and doesn't seem to care about the gaps.

Note: When testing with scaletempo or rubberband, it's not possible to test the filter removal gap behavior for 1x speed, because they will always stay active when explicitly in the af graph. But it's possible to compare the gaps by adding/removing one of the tempo filters from af. Needs to be at 1x speed so scaletempo2 isn't added automatically.

So it looks like your code needs more audio data internally before wanting to output data.

This is probably because the output_time is now initialized consistently with the search_block_index. In the old code I could see the window shifting into negative values, leading to more iterations with less input data (i.e. producing more audio than expected).

Memory access issue in final packet handling

There was also a serious memory access issue in the handling of the final output packet: A missing dereference wasn't noticed, which caused wrong data to be overwritten with zeros. I added a separate commit for the fix, but the code is rewritten in the next commit anyways.

It probably never got noticed because the end of files is often very quiet, the written area was luckily in the audio area instead of the pointers of the 2d area, and it's hard to notice a problem in the last 10ms of output.

@DanOscarsson
Copy link
Contributor

Good. It looks like you found a way to flush the remaining data in filter upon "final". The code before with a final flag for fill of input buffer did not work as I expected.
I have only done a very quick test and it looks better.
Though I think a good filter should be able to, when told that this is final audio data, flush out all inside filter and produce audio corresponding to input audio.
It does not need to know what speed any audio coming after the final input audio should have.
I think a resampler filter can do that.
Anyway I think it is good enough now.
I have to test some more before saying I cannot find any more problems.
With my quick test I saw one thing that did not match my code so I have to check if it is your new code or mine that is doing it wrong.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 15, 2023

Thinking about this a bit, I realized that after returning to 1x speed, there is now a constant offset, because the target_block_index could be anywhere in the search block.

How about realigning the timing with the current target_block_index during 1x speed, which would produce perfectly aligned audio again?

This would produce a gap in output packets, up to search-interval in length (varying depending on where the index happens to be). To my knowledge wouldn't be audible, because the produced samples are played back seamlessly. As to video-sync I think such a small jump in video playback should be bearable to sync playback up again (if video-sync=audio, but I think display-resample should be fine too) .

I'll make some tests later. I think the easiest way would be to move the search block to the target_block_index during 1x speed, which would also align it properly for switching to different speed again.

@DanOscarsson
Copy link
Contributor

As I do not use the filter for speed = 1 I do not test that.
I have done some more testing and it looks good.
I have code to handle it during filtering the filter wants to output more audio
then is has received. In my tests so far this has not happened so I can probably
remove that code (which is nice as it has to cache data for next round).
Your calculation of pts matches my value for my different way to calculate it, very closely. It looks like +/- 10 microseconds difference. Good.
Currently the only special things are what should be done of EOF (filter removal).
There can be to much data and to little.
Your code to truncate data when to much is not correct.
You could, for example, instead add after mp_aframe_mul_speed(out, p->speed);
The code:
if (pts != MP_NOPTS_VALUE && p->sent_final) {
if (mp_aframe_end_pts(out) > pts) {
mp_aframe_clip_timestamps(out, mp_aframe_get_pts(out), pts);
mp_scaletempo2_reset(&p->data);
}
}

which works more like my own code for it.
Though I would have liked at version of the clip timestamps code that rounded samples which would have clipped to closest timestamp value.
In my code I also handle missing data by adding silence.
Have to remove all checks and clean that out before I can present it here.

@DanOscarsson
Copy link
Contributor

When a seek is done, should the filter do a reset?
Otherwise there might be some left data inside the filter that will be output as audio for the new position that was seeked to.
But as nobody have complained that they hear audio from wrong place when they seek, it is probably so small/seldom that nobody notices it.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 15, 2023

Your code to truncate data when to much is not correct.

What is not correct exactly? Do you mean the calculation, or is it that I specify a lower amount of samples to truncate it while passing the non-truncated buffer?

As far as I could tell, the number of frames are calculated correctly in my code, but in most cases there are too few remaining input samples in the buffer to produce that many frames. The central problem here is that the target block position is very fluid while the filter is active, anywhere up to search-interval ahead of the reported playback position to keep the audio smooth while it's cutting out or repeating frames to produce the requested speed. (It's imperative to how the filter works and incidentally what produced the gaps/overlaps my very initial PR, which exposed the real audio position in output packets.)

I quickly ruled out producing more silence as a solution, because it resulted in more jarring clicks in the audio when the filter is removed. I concluded that a pts gap is more favorable than producing silence.

I suppose we could alternatively calculate a speed value for frames after eof to bridge over to the final value.

When a seek is done, should the filter do a reset?

As far as I can tell, seeks always run the reset command of the filter, so it shouldn't be needed.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 15, 2023

I've added the fix for exact sync when returning to 1x speed, for anyone who keeps the filter active during normal playback. It also drops unneeded frames at the start of the input buffer, which saves some cycles when seeking the buffer.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 16, 2023

Overlooked the output_time in the previous commit. It's also reset now.

@DanOscarsson
Copy link
Contributor

DanOscarsson commented Aug 16, 2023

Your code with truncating is correct. I did a copy past into my code and got one thing wrong. Though I have added rounding to it as that will get less gaps.
Also, prefer adding silence when missing (and that does your code do on final handling) when filter is to be removed. At least in the video I test against I cannot hear any new clicks so at least it is very seldom.
My differences with yours including my add silence code is as follows:

--- af_scaletempo2.c	2023-08-16 12:25:33.319606294 +0200
+++ audio/filter/af_scaletempo2.c	2023-08-16 15:52:46.638760108 +0200
@@ -1,3 +1,5 @@
+#include <math.h>
+
 #include "audio/aframe.h"
 #include "audio/filter/af_scaletempo2_internals.h"
 #include "audio/format.h"
@@ -115,7 +117,7 @@
             if (p->sent_final) {
                 double remain_pts = pts - mp_aframe_get_pts(out);
                 double rate = mp_aframe_get_effective_rate(out) / p->speed;
-                int max_samples = MPMAX(0, (int) (remain_pts * rate));
+                int max_samples = MPMAX(0, (int)round(remain_pts * rate));
                 // truncate final packet to expected length
                 if (out_samples >= max_samples) {
                     out_samples = max_samples;
@@ -126,6 +128,24 @@
 
         mp_aframe_set_size(out, out_samples);
         mp_aframe_mul_speed(out, p->speed);
+        if (pts != MP_NOPTS_VALUE && p->sent_final &&
+            !mp_scaletempo2_frames_available(&p->data, p->speed) &&
+            pts > mp_aframe_end_pts(out)) {
+                int add = round((pts - mp_aframe_end_pts(out)) * mp_aframe_get_effective_rate(out));
+                if (add > 0) {
+                    struct mp_aframe *new = mp_aframe_new_ref(p->cur_format);
+                    if (mp_aframe_pool_allocate(p->out_pool, new, mp_aframe_get_size(out) + add) < 0) {
+                        talloc_free(out);
+                        talloc_free(new);
+                        goto error;
+                    }
+                    mp_aframe_copy_attributes(new, out);
+                    mp_aframe_copy_samples(new, 0, out, 0, mp_aframe_get_size(out));
+                    mp_aframe_set_silence(new, out_samples, add);
+                    talloc_free(out);
+                    out = new;
+                }
+        }
         mp_pin_in_write(f->ppins[1], MAKE_FRAME(MP_FRAME_AUDIO, out));
     }

With that it works fine for me.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 16, 2023

I tested the code and the pts-gaps were fixed. Sadly I can notice a slight difference with the clicks. To confirm it, I changed the default values for search-interval and window-size to 300 and 200 (this can't be modified by config without --af=scaletempo2 AFAIK, but bear with me here). It won't be that excessive with the default values, but it still shows that it makes the audible gaps worse.

I wonder if the removal of pts-gaps (in particular for filter removal) is that important.
Do you have a way to reproduce a problem caused by pts-gaps?

The only problem I can think of is video-sync. If set to audio, mpv has to make the video catch up with the jump in pts. If set to display-resample, it might slow down audio to compensate, not sure. I never noticed either of these happening, so it seems fine in practice.

As I understand it: pts-gaps don't cause clicks per se, because mpv plays the audio seamlessly. The reason clicks happen in that moment is that, when skipping over audio, the frames before and after the gap don't mesh together. Adding silence makes the problem worse, because now it can produce a click at the start and the end of a gap, and play muted audio for a moment.

As I currently stand, I think it's better to avoid producing silence. IMO, feeding the filter with silence after EOF like it currently does is a bit dubious too, for the same reasons. But at least the amount is limited to one window-size and includes a fade out from the overlap-and-add operation.

@DanOscarsson
Copy link
Contributor

Maybe the clicks are depending on what audio out one is using.
I am using pulseaudio.
I cannot get any clicks playing the countdown video. I tested with your code, with my silence add code and my repeat filter audio code.
Are you using a different audio out?

@ferreum
Copy link
Contributor Author

ferreum commented Aug 18, 2023

I'm using pipewire. I get the clicks with native pipewire output, pulseaudio output, or alsa output. They all go through pipewire though.
Is there anything mpv does to the ao when the filter configuration is changed? I don't know of a reason for it to do so. In that case, it would also be irrelevant for this issue.
I don't think it's pipewire. I remember getting clicks for filter changes for as long as I used mpv, before I switched to pipewire.

`output_time` is used to set the center of the search block. Init of
both `search_block_index` and `output_time` with 0 caused inconsistent
search block movement for the first iterations.

Initialize with `search_block_center_offset` for consistency with initial
`search_block_index`.
`read_input_buffer` needs to respect the `target_block_index`, otherwise
the audio resumes at the wrong position.
The first WSOLA iteration overlapped audio with whatever was in the
`wsola_output` buffer. This was either silence (if not run before), or
old frames (if switching to 1x and back to a different speed).

Track the state of the output buffer and memcpy the whole window for the
first iteration instead.
@DanOscarsson
Copy link
Contributor

I have done some more tests, starting from mpv master.
I cannot hear clear clicks, but I can sometimes hear slightly distorted audio.
And I can hear it with standard code, with your changes and with my additions.
I cannot say if any code is worse than any other.
I have been running code which fixes scaletempo for many years (form before scaletempo2 was added) which includes repeating last audio to fill hole.
Yes, listening carefully I can hear slight distortions with it too (sometimes).
Over the years there have been clicks a few times. But nothing I normally is disturbed by.
To really fix the problem the scaletempo2 internals (and other filters) probably need special "final" handling that outputs audio in a way that it will fill last missing audio and match the ending of input audio to reduce possibility of clicks/distortions. I guess that will have to be for some other time.
I have not looked at how the audio samples look in the last audio packet to see if there could be done something outside the filters to smooth over the audio into the next audio packet.
Do like you want. I will add my additions to my version of mpv to fill missing audio to avoid gaps as that makes playing video during speed change smoother.

@ferreum
Copy link
Contributor Author

ferreum commented Aug 19, 2023

Ok, thank you, I think we might be talking about the same thing with distortions/clicks.

To really fix the problem the scaletempo2 internals (and other filters) probably need special "final" handling that outputs audio in a way that it will fill last missing audio and match the ending of input audio to reduce possibility of clicks/distortions.

I agree 100%. I've thought about this as well, but it would be too complicated here. We could create a separate issue and I might look into it another time though. I need to take a break from this for a bit :)

Currently cleaning stuff up and testing everything in my daily usage. Will probably update tomorrow.

The internal time update function involved multiple problems:

- Time was updated after WSOLA iteration. The means speed was updated
  one iteration later than it could be.
- The update functions caused spikes of too many or too few samples
  advanced, leading to audio glitches on speed changes.
- The inconsistent updates made it very difficult to produce gapless
  audio packets.
- The `output_time` update function involved complicated feedback:
  `search_block_index` influenced how many frames from `input_buffer`
  are retained, which influenced how much `output_time` is changed,
  which influenced `search_block_index`.

With these changes:

- Time is updated before WSOLA iterations. Speed changes are effective
  instantly.
- There are no spikes in playback speed during speed changes.
- No significant gaps are introduced in output packets.
- The time update function becomes (function calls omitted for brevity)

    output_time += ola_hop_size * playback_rate

Functions received a `playback_rate` parameter to check how many samples
are needed before iteration. Internal state is only updated when the
iteration is actually run, so the speed is allowed to change until
enough data is received.
Target block can be anywhere in the previous search-block, varying by
`search-interval` while the filter is active. This resulted in constant
audio offset when returning to 1x playback speed.

- Move the search block to the target block to sync up exactly.
- Drop old frames to minimize input_buffer usage.
This changes the emitted pts values from the start of the search block
to the center of the search block. Change initial `output_time`
accordingly. Initial `search_block_index` is irrelevant, because it's
overwritten before the first iteration.

Using the `output_time` removes the rounding of `search_block_index`,
which also fixes the <20 microsecond gaps in timestamps between output
packets.

Rationale:

The variance in audio position was in the range `0..search-interval`.

With this change, the range is

    (- search-interval / 2)..(search-interval / 2)`

which ensures lower maximum offset.
@ferreum
Copy link
Contributor Author

ferreum commented Aug 20, 2023

I did my best to clean everything up; here's a (hopefully final) summary of all the changes in this PR:

  • Fixed Speed changes with scaletempo2 can cause audio/video desynchronization #12028 (a2c39d3)
  • Fixed audio delay of half a window-size. (a2c39d3)
  • Fixed missing dereference after EOF and on format change, overwriting wrong data and potentially 2d-array pointers with NULL. (520e6b6)
  • Fixed gaps and overlaps between pts-ranges of produced audio packets during speed changes. (multiple commits)
  • Fixed speed changes being applied one iteration later than requested. (c8d2674)
  • Fixed jumps in audio during speed changes. (c8d2674)
  • Fixed audio artifact when switching from 1x to non-1x speed (6978195)
  • Fixed audio position when switching from non-1x to 1x speed, which caused audio artifact and slight desync. (b0932c8)
  • Fixed buffered audio not being fully processed on EOF or format change. (8055e81)

@ferreum ferreum marked this pull request as ready for review August 20, 2023 09:10
@ferreum
Copy link
Contributor Author

ferreum commented Aug 20, 2023

Concerning 520e6b6, here's a way to reproduce a segfault that has been fixed:

ffmpeg -f lavfi -i "sine=frequency=1000:duration=0.0001" -ac 8 test.flac

This creates an extremely short audio file with 8 channels. When playing this back with master with either --speed=1.1 or --af=scaletempo2, mpv crashed. With the fix it completes fine.

It didn't crash with longer audio files, because the input_buffer was kept full enough that it didn't overwrite the 2d-array pointers at the start of the memory allocation.

@christoph-heinrich
Copy link
Contributor

Can confirm that the desync and the audio artifact when switching from 1x to non-1x speed are fixed by this.

I always thought that audio artifact was because of my conditional profile to switch from display-tempo to display-resample when the speed is close to 1, but apparently it was a problem with scaletempo2 and I'm glad you fixed it 👍

I'm not at all familiar with the relevant code though. Maybe I'll have a look when I find some time, but not sure I can provide any valuable feedback there.

@haasn
Copy link
Member

haasn commented Sep 16, 2023

I did some local testing and everything appears to be working as expected. However, the details of this filters internals are well over my head from a quick glance.

Perhaps @DorianRudolph can give a quick comment, seeing as they initially authored this filter? Other than that, it looks as though @ferreum is the de-facto maintainer of this filter, so I'm inclined to just merge it without further review.

@DorianRudolph
Copy link
Contributor

I think at this point ferreum and christoph have a better understanding of this than me. But it sounds and looks good to me, though I don't notice a difference :)

Copy link
Contributor

@christoph-heinrich christoph-heinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at the last two commits tomorrow, the rest looks fine.

Commit 3 could have been part of commit 1.
In mp_scaletempo2_init() search_block_center_offset gets set to 0 at the top and then to something else further down without any reads in between, but that's not your fault.

@@ -702,7 +702,15 @@ int mp_scaletempo2_fill_buffer(struct mp_scaletempo2 *p,
// Optimize the most common |playback_rate| ~= 1 case to use a single copy
// instead of copying frame by frame.
if (p->ola_window_size <= faster_step && slower_step >= p->ola_window_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something here, but if speeds very close to 1 get treated as if speed is exactly 1, doesn't that mean that the output will eventually get out of sync because there is never any overlap happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wasn't sure about this check, so I didn't touch it. As I understand it, the whole condition is a complicated way to check if playback_rate rate is within 1 +/- (1/ola_window_size). I have no idea why it's done with all the ceilf() and division/multiplication. It originates from the chrome code.

As to de-synchronization, I just realized that could actually be a problem. Assuming 44100Hz sample rate, the filter becomes active around 1.001134x. A quick calculation (units '100ms*882' time) leads me to merely 1:28 minutes playback for 100ms difference (assuming worst-case speed like 1.00112x). In usual video-sync modes this gets corrected in some way, but I think we should adjust this to improve sync.

I believe there should still be an epsilon to guarantee perfect audio playback near 1x speed. If I use multiply to change speed back and forth with a skewed number and its reciprocal, some error keeps accumulating, but I really have to try to get it above 1e-14. I'd suggest a safe 1e-10 (>31 years for 100ms desync).

I also just realized that the filter's p->speed is still a float, so that needs to be changed to double for this to work.

Copy link
Contributor

@christoph-heinrich christoph-heinrich Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming 44100Hz sample rate, the filter becomes active around 1.001134x

That would be with default parameters right? I use window-size=40 so that would be even worse.

We could leave the epsilon as is, but add a check to to occasionally do an overlap iteration to get back in sync.

Copy link
Contributor Author

@ferreum ferreum Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be with default parameters right?

Exactly. Adjusting the parameters would change the threshold. Actually, increasing the window-size increases the resolution (the epsilon is 1/ola_window_size, not ola_window_size), so a larger value results in earlier activation of the filter.

We could leave the epsilon as is, but add a check to to occasionally do an overlap iteration to get back in sync.

That's effectively what would happen when we completely remove this case. The filter keeps overlapping half-window-sized blocks with the same data, resulting (theoretically) in no audio change. That's until the target_block is not in the search window, which is when the WSOLA actually does its work--essentially moving the target block to the best-matching position within the search window.

I think this speaks in favor of reducing the epsilon to a very small value. If you want to try it, I've pushed a commit to a separate branch for testing (last commit). I've not found a way to even measure the difference though. I presume the hardware introduces more timing difference than the effects we're discussing.

Copy link
Contributor

@christoph-heinrich christoph-heinrich Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's effectively what would happen when we completely remove this case.

Yes exactly, the whole point of that if is that it's an optimization right? But the way it is right now it's not only faster but also changes the behavior.

Instead of checking for speed being close to 1, we could check for if WSOLA would actually do something (target block is within search window, as you said). Maybe even with an epsilon here so that it still takes the shortcut if it's only a little bit outside of that or something. That would technically still be a change in behavior but as long as it doesn't drift off too much (because it's bounded) that shouldn't be a problem.

If we make the epsilon tiny it'll never get to take that shortcut with the usual video-sync modes because the drift compensation puts the speed too far away from 1. (proabably, didn't test)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://chromium.googlesource.com/chromium/src/+/refs/heads/main/media/filters/audio_renderer_algorithm.cc#252
Looks like chromium is still doing the same thing. Too bad, otherwise we could have taken their solution.

Copy link
Contributor Author

@ferreum ferreum Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can vaguely think of a way your suggestion might work as a general optimization, but I'm not sure how much would need to be changed. Essentially I would move/duplicate the target_is_within_search_region(p) check here and only go through the run_one_wsola_iteration if the target block is outside the search region.

I can make some tests to work out the details and see if this is viable, but no guarantees.

The (small) epsilon would still be useful regardless of the optimization, because at the same time the filter reverts any audio offset it introduces.

I'm not sure if the optimization is worth whatever changes it requires though. I didn't profile it, but it seems very efficient as-is. It looks like micro-optimization where I'm standing.
Additionally I think these tiny speed differences are not relevant in practice, because with video-sync=display-*, I usually see much larger magnitudes of audio-speed-correction. And with video-sync=audio there should be no offset, as I understand it; playback speed is only by that miniscule amount slower or faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as is for now. Since it's not a problem with any kind of video-sync that actually matters. It will only drift apart for sync modes that are already expected to drift apart, so even then nobody would actually notice.
It might still be a problem for tiny values of video-sync-max-audio-change, but I don't think anyone touches that anyway.

Either way I'd like to get this merged, and a fix for that drift would then be a separate PR.
(if you care enough to actually do something about it, totally understandable if you don't)

Copy link
Contributor

@christoph-heinrich christoph-heinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to find out why that reset is there I've noticed that target_block_index never gets initialized. The reset function sets it to 0, but that doesn't happen when the filter starts getting used.

} else if (format_change) {
// go on with proper reinit on the next iteration
p->initialized = false;
p->sent_final = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why leave an empty if? It adds some context to the comment, but that context could be added to the comment itself. for format change go on with...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; I only saw the redundant lines and removed them.

Btw, should I continue rewriting affected commits that need these improvements, or should I push fixes as new commits? I'm not sure if rewriting would make reviewing difficult at this point.

Copy link
Contributor

@christoph-heinrich christoph-heinrich Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the already existing commits please.
The uninitialized variable thing would be a separate commit.

// truncate final packet to expected length
if (out_samples >= max_samples) {
out_samples = max_samples;
mp_scaletempo2_reset(&p->data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that reset necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the mp_scaletempo2_set_final method is called, the internals may produce more audio than expected, depending on the target block position and which blocks are chosen if WSOLA runs. The reset clears all this internal state to guarantees that mp_scaletempo2_frames_available returns false again.

After that, processing can continue as normal again. E.g. if there's a format change, then the first while loop of process() will reinitialize the internals with the new format (first iteration sets p->initialized to false and the next loop calls init_scaletempo2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes more sense now. A comment explaining that would be good.

@ferreum
Copy link
Contributor Author

ferreum commented Sep 19, 2023

While trying to find out why that reset is there I've noticed that target_block_index never gets initialized. The reset function sets it to 0, but that doesn't happen when the filter starts getting used.

Good find. Additionally input_buffer and input_buffer_size also aren't initialized here, and all the realloc_2d calls receive an uninitialized pointer the first time. The problem is that mp_scaletempo2_init is also called to change audio format. Because the memory regions are re-used, it's not as simple to fix as to add an assignment.

Things seem to work out because the struct seems to be initialized with zeroes. Is it a problem to rely on that? Simplest fix would probably be to split the init function in two, and NULL all the pointers only the first time.

Edit: on a more serious issue, the input_buffer isn't cleanly resized when the format changes. The size checks only look at the length, but not the number of channels. I have no idea how to get a file with format changes, so don't know how to test. It's probably best to free the buffer on format change and allocate on next use.
Edit2: Scratch that, I just noticed the resize_input_buffer call in init.
Edit3: nothing special about input_buffer here

After the final input packet, the filter padded with silence to allow
one more iteration. That was not enough to process the final frames.

Continue padding the end of `input_buffer` with silence until the final
frames have been processed.

Implementation: Instead of padding when adding final samples, pad before
running WSOLA iteration. Count number of added silent frames and
remaining input frames for time keeping.
Avoid generating too much audio after EOF.

Note: This often has no effect, because less audio is produced than
required.

Usually this comes to effect with the userspeed filter at high speed
(4x) and going back to 1x speed to remove the filter.
@ferreum
Copy link
Contributor Author

ferreum commented Sep 19, 2023

Updated the last 2 comments and added another:

  • added a comment
  • removed an empty else case
  • initialized target_block_index on init
  • removed the redundant search_block_center_offset assignment

Copy link
Contributor

@christoph-heinrich christoph-heinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long until I reviewed this.
LGTM

@haasn haasn merged commit 95157bb into mpv-player:master Sep 20, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed changes with scaletempo2 can cause audio/video desynchronization
6 participants