CTC decoder (number of alignments, speech to text basics) #2223

p0p4k · 2022-05-25T02:34:42Z

p0p4k
May 25, 2022

This article presumes some basics about STT knowledge from the reader.

When reading about CTC loss on distill , I encountered the following sentence :

Here, U is length of STT output text Y and T is length of audio input X.
U is text, T is audio. Usually, U<=T.
With ε being the silences represented in text data, find total number of combinations of X given Y.
Imagine a sequence example {a,b} (U = |Y| = 2) and T=|X|=6.
The problem can be visualized as shown in the picture below.

To further simplify the solution, we need to realize that the eventual sequence of X will look something like
[{...ε...} {...a...} {...ε...} {...b...} {...ε...}] .
Sometimes there might not be any ε , sometimes only 1a and/or 1b box (but at least 1a and 1b is required!! to get 'ab').
The hint is we need to just find 4 boxes, where 'a' begins , 'a' ends, 'b' begins and 'b' ends. Since, after 'a' ends, we don't have to worry about 'a' and same goes for 'b'. Fill up rest of the boxes with 'ε' .
At first, I thought we can add 4 boxes (i.e., 2U) to T and then find 4 boxes which represent the start and end points of 'a' and 'b' in order and then fill up rest with 'ε'. However, what if we choose 4 adjacent boxes? Then we cannot fill and 'a' or 'b'. The trick is to then add just 2 boxes (i.e., U) to T and find 4 boxes and now box number 1 and 3 will contain 'a' and 'b' respectively as their start points. This makes sure whatever 4 boxes we choose, we always end up with at least 1a and 1b.
So, the final answer is C(U+T,2T). Choose 2T boxes from U+T boxes, fill T start boxes with T characters respectively and rest with 'ε'.

What is the significance of this calculation?

The number of different alignments is what we have to go through to find the most probable alignment for a given audio file. CTC aims to reduce this search region by using dynamic programing. I will try to explain that next time.
If you have a simpler way of understanding the combination result, please share it. Thanks and good day! 🐸

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTC decoder (number of alignments, speech to text basics) #2223

{{title}}

Replies: 0 comments

Select a reply

CTC decoder (number of alignments, speech to text basics) #2223

p0p4k May 25, 2022

What is the significance of this calculation?

Replies: 0 comments

p0p4k
May 25, 2022