Is there a limit to the audio duration? #181

JJun-Guo · 2023-06-06T03:18:22Z

Is there a limit to the audio duration?

HarikalarKutusu · 2023-06-06T04:10:42Z

Hey @JJun-Guo, recordings in Common Voice are currently limited to 10 seconds.

Here is a related recent discussion on allowing more:
https://discourse.mozilla.org/t/discussion-relaxation-of-the-10-sec-recording-limitation/114142

JJun-Guo · 2023-06-06T04:46:58Z

hi，how about the shortest time limit？ Junjun Guo ***@***.*** 发自网易邮箱大师

…

---- 回复的原邮件 ---- 发件人 Harikalar Kutusu (a.k.a. Bülent ***@***.***> 日期 2023年06月06日 12:10 收件人 ***@***.***> 抄送至 Jue ***@***.***>***@***.***> 主题 Re: [common-voice/cv-sentence-extractor] Is there a limit to the audio duration? (Issue #181) Hey @JJun-Guo, recordings in Common Voice are currently limited to 10 seconds. Here is a related recent discussion on allowing more: https://discourse.mozilla.org/t/discussion-relaxation-of-the-10-sec-recording-limitation/114142 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

HarikalarKutusu · 2023-06-06T04:50:05Z

I need to check it from the code, but from my head, it was 1 sec but dropped to 0.5...

Actually, as it also includes silences, short uttrences can easily be recorded putting a silence at the start or at the end while recording.

HarikalarKutusu · 2023-06-06T04:55:24Z

I was wrong. It is 1 sec. 0.5 sec is for the benchmark sentences (numbers etc).

https://github.com/common-voice/common-voice/blob/3bccdf446f6acd8a9afda1db7a9a1664457e611d/web/src/components/pages/contribution/speak/speak.tsx#L42

But as I stated on the link given in the previous post, state-of-the art models work better with longer utterences. E.g. whisper best works for 5-25 sec recordings...

So, it is better to get an average char duration and calculate a minimum sentence length from there...

JJun-Guo · 2023-06-06T05:10:19Z

Wouldn't the short time affect downstream tasks? Such as speech recognition. The overall distribution time of the data set is 1-10s, so which range is most of the data concentrated in? 郭军军 ***@***.***

…

---- Replied Message ---- From Harikalar Kutusu (a.k.a. Bülent ***@***.***> Date 06/6/2023 12:55 To ***@***.***> Cc Jue ***@***.***> , ***@***.***> Subject Re: [common-voice/cv-sentence-extractor] Is there a limit to the audio duration? (Issue #181) I was wrong. It is 1 sec. 0.5 sec is for the benchmark sentences (numbers etc). https://github.com/common-voice/common-voice/blob/3bccdf446f6acd8a9afda1db7a9a1664457e611d/web/src/components/pages/contribution/speak/speak.tsx#L42 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

HarikalarKutusu · 2023-06-06T05:31:43Z

AFAIK, a rule-of-thumb is to train a model with data which it will see in the wild. For a general purpose ASR model where the model is subjected to everyday speech, I think it should include shorter ones, because spontanous speech/conversations include them extensively, like in short answers to questions: yes-no-ok-fine-etc, "What do you want?" => "Tea..." like...

I think it is best to have a more-or-less evenly distributed durations (flat curve), thus sentence lengths. One could work on the betterment of their Common Voice dataset to remedy peaks in the distribution.

I created webapps where people can examine their datasets in more details, also helping in this area - for all CV languages.
For example, this is the duration distribution of CV 13.0 Turkish validated recordings:

And this is the distribution in text corpus:

Because we had little CC0 sentence resources, we had to rely on volunteers writing common everyday stuff, which are short and dropped the average recording duration to 3.6 - from around 4 secs. We need to remedy this issue...

You can check your language from here:
https://analyzer.cv-toolbox.web.tr/

You can also check the overall changes in time here:
https://metadata.cv-toolbox.web.tr/

HarikalarKutusu · 2023-06-06T05:59:12Z

If you are working on the cv-sentence-extractor rules (first run):

Getting longer sentences are better I think. It is easier to get shorter sentences from other sources. Once it gets data from an article, it is done.

Some points on this:

As stated, state-of-the-art models train better starting with 5 secs.
Most languages in CV have an average of 4-6 secs.
A longer sentence will result in a longer duration recording.
And finally what matters is the training/fine-tuning train set duration.
Instead of getting 3000 sentences from 1000 articles with average 4 sec, if you take so that the average is 8, the duration will double.

MichaelKohler · 2023-06-07T21:41:00Z

Instead of getting 3000 sentences from 1000 articles with average 4 sec, if you take so that the average is 8, the duration will double.

Not wrong, but might be risky without proper testing. Note that if the Sentence Extractor can't find 3 sentences with the required length, it will not continue to try with less words, it will just use what it got and continue on to the next article. Of course with proper analysis of the source it would be possible to fully optimize this.

HarikalarKutusu · 2023-06-07T21:46:24Z

@MichaelKohler, can this be made adaptive? I mean, not to put an absolute minimum, but set a "requested_minimum", if the 3 sentences are not found, fill it with shorter ones...

MichaelKohler · 2023-06-07T21:50:06Z

Yes, certainly would be an option, but that would need to be implemented. Overall this would mean going over the sentences multiple times for the case where it won't find enough sentences the first time, but probably not such a big hit on performance overall. In the end, for development purposes that won't matter and for the final run it's fine as well as that runs in the GitHub Action.

HarikalarKutusu · 2023-06-07T21:54:02Z

As you know working on this was on my to-do list, if only I can get really good results... I'll look into this. E.g sorting sentences by length can help performance.

MichaelKohler · 2023-06-07T22:01:25Z

sorting sentences by length can help performance.

Mh, this made me think. Now I wonder if the legal requirement is just "maximum 3 sentences per article" or if there could be issues if we always pick the 3 longest sentences. In some articles the longest 3 sentences might be the majority of content. Probably something that would need to be verified just to make sure. To be clear: I only ever knew about the "maximum 3 sentences per article" without any further restrictions, but I can't guarantee that this is exactly what the lawyers said.

HarikalarKutusu · 2023-06-07T22:08:18Z

Very good point... But this is how it works now, isn't it? So, as of now, if an article has 3 sentences, they are taken if the rules match.
One could add a check for it so that the char count of the selected (verified by rules) sentences is at most (say) 50% of the total for example.

MichaelKohler · 2023-06-08T21:06:13Z

Right now it's fully random, but rejecting what does not fit the rules. So generally, by analysis the full Wikipedia dump, you could optimize the minimum words rule to get the most words out. But that would be different than always taking the longest sentences.

Of course depending on the requirements additional rules can be added. At this point I don't even know if it would be a problem or not to do it that way.

HarikalarKutusu · 2023-06-09T03:43:08Z

As I mentioned above, with the state-of-the-art models and HW advancements, it is better to get longer audio, thus longer texts. A change in this repo towards this goal would be awesome. Especially because there is no going back once 3-4 word sentences are taken...

With longer sentences, duplicates/similarities will also drop substantially, and more possible vocabulary will go into the text-corpus. I think more common words are already in the corpora or can easily be added from other sources, but less frequent ones will be needed by everyone (if too-technical/problematic/hard-to-read ones got correctly ruled out).

If it is legally possible of course...

MichaelKohler · 2023-06-14T19:54:13Z

@jessicarose Analog to the other question I tagged you in, could you also check here if we in theory would be allowed to always take the 3 longest sentences per article? Thanks!

HarikalarKutusu · 2023-06-22T14:14:03Z

Sorry to ping the issue...

I'm nearly finalizing my work and I need to ask if taking the longest three sentences will ever be possible - because there is no going back.

HarikalarKutusu · 2024-03-16T18:24:20Z

Is there a limit to the audio duration?

@JJun-Guo, the recording limit is increased to 15 seconds in Common Voice v1.114.2.

@MichaelKohler: Probably all rule files should adapt to this change, including the defaults.

MichaelKohler · 2024-03-16T22:01:34Z

@HarikalarKutusu Thanks for keeping track of this. I agree. Do you know what the correct value for EN would be and then we set that as default? And do you have time to reach out to all language contributors to get a new estimate? I'd be fine with one PR updating all the values as I think it's rather low-risk of a change. One thing to note is that some languages use characters and some use words.

HarikalarKutusu · 2024-03-17T01:48:35Z

@MichaelKohler I think a 50% increase should be fine for both max words and characters.
The problem is with the minimums. The new Common Voice "guideline" suggests 10-15 sec recording times on par with the newer model architectures.

With the new v17.0, I can add some character speed measurements and possibly per user, and their distribution in the Analyzer, so that one can for example see the 95 percentile coverage from those values. But that part should be handled by communities like you suggest.

For those languages which already did run the cv-sentence-extractor, most probably already got shorter sentences and might like to increase that limit for re-runs, also taking into account the recently introduced rules.

I have time for PRs and posts in Discourse, but you might need to point to them in case somebody decides on a re-run...

MichaelKohler · 2024-03-17T08:49:22Z

but you might need to point to them in case somebody decides on a re-run...

I can try to keep this in mind :)

HarikalarKutusu · 2024-04-03T18:36:24Z

I opened a discussion here, your input would be very valuable:
https://discourse.mozilla.org/t/discussion-best-practices-steps-for-increased-recording-duration/129205

MichaelKohler added help wanted Extra attention is needed question Further information is requested extract-improvements labels Jun 13, 2023

MichaelKohler removed the help wanted Extra attention is needed label Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a limit to the audio duration? #181

Is there a limit to the audio duration? #181

JJun-Guo commented Jun 6, 2023

HarikalarKutusu commented Jun 6, 2023

JJun-Guo commented Jun 6, 2023 via email

HarikalarKutusu commented Jun 6, 2023 •

edited

Loading

HarikalarKutusu commented Jun 6, 2023 •

edited

Loading

JJun-Guo commented Jun 6, 2023 via email

HarikalarKutusu commented Jun 6, 2023

HarikalarKutusu commented Jun 6, 2023

MichaelKohler commented Jun 7, 2023

HarikalarKutusu commented Jun 7, 2023

MichaelKohler commented Jun 7, 2023 •

edited

Loading

HarikalarKutusu commented Jun 7, 2023

MichaelKohler commented Jun 7, 2023

HarikalarKutusu commented Jun 7, 2023 •

edited

Loading

MichaelKohler commented Jun 8, 2023

HarikalarKutusu commented Jun 9, 2023

MichaelKohler commented Jun 14, 2023

HarikalarKutusu commented Jun 22, 2023

HarikalarKutusu commented Mar 16, 2024

MichaelKohler commented Mar 16, 2024

HarikalarKutusu commented Mar 17, 2024 •

edited

Loading

MichaelKohler commented Mar 17, 2024

HarikalarKutusu commented Apr 3, 2024

Is there a limit to the audio duration? #181

Is there a limit to the audio duration? #181

Comments

JJun-Guo commented Jun 6, 2023

HarikalarKutusu commented Jun 6, 2023

JJun-Guo commented Jun 6, 2023 via email

HarikalarKutusu commented Jun 6, 2023 • edited Loading

HarikalarKutusu commented Jun 6, 2023 • edited Loading

JJun-Guo commented Jun 6, 2023 via email

HarikalarKutusu commented Jun 6, 2023

HarikalarKutusu commented Jun 6, 2023

MichaelKohler commented Jun 7, 2023

HarikalarKutusu commented Jun 7, 2023

MichaelKohler commented Jun 7, 2023 • edited Loading

HarikalarKutusu commented Jun 7, 2023

MichaelKohler commented Jun 7, 2023

HarikalarKutusu commented Jun 7, 2023 • edited Loading

MichaelKohler commented Jun 8, 2023

HarikalarKutusu commented Jun 9, 2023

MichaelKohler commented Jun 14, 2023

HarikalarKutusu commented Jun 22, 2023

HarikalarKutusu commented Mar 16, 2024

MichaelKohler commented Mar 16, 2024

HarikalarKutusu commented Mar 17, 2024 • edited Loading

MichaelKohler commented Mar 17, 2024

HarikalarKutusu commented Apr 3, 2024

HarikalarKutusu commented Jun 6, 2023 •

edited

Loading

HarikalarKutusu commented Jun 6, 2023 •

edited

Loading

MichaelKohler commented Jun 7, 2023 •

edited

Loading

HarikalarKutusu commented Jun 7, 2023 •

edited

Loading

HarikalarKutusu commented Mar 17, 2024 •

edited

Loading