Question on stream_asr.end() function for streaming asr #76

espnetUser · 2022-12-13T09:59:44Z

In the readme the example for streaming asr shows the use of start() and end() methods:

from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name)

# start streaming asr
stream_asr.start()
while streaming:
  wav = <some code to get wav>
  assert len(wav) == stream_asr.hop_size
  stream_text = stream_asr(wav)[0][0]

# You can get non-streaming asr result with end function
nbest = stream_asr.end()

In a real streaming scenario should the start() and end() methods be called whenever the microphone is opened and closed?

I am asking because I noticed that the end() function in https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 calls the self.batch_beam_search() function which will restart decoding from postion 0 again causing a rather large delay for longer speech inputs. If I change https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 to use self.beam_search() method instead it avoids decoding the entire utterance at the end again and thus the delay.

Could you please clarify why self.batch_beam_search() is used in stream_asr.end() function?

Thanks!

The text was updated successfully, but these errors were encountered:

Masao-Someki · 2022-12-17T14:35:41Z

Hi @espnetUser,
When I implemented the streaming model, the batch_beam_search was faster, so I chose this function. However, I fixed some bugs related to beam search after I made the comparison, so maybe we need to replace the beam search function...

espnetUser · 2022-12-20T14:02:50Z

Hi @Masao-Someki,

my question only concerns the beam search call in stream_asr.end() method (https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151) not the beam search function in general.

Let me explain with an example. Here is a list of debug logs that show the times and beam searches used as well as the position (output index) in the beam search plus some comments about when stream_asr.start()/end() calls were made:

### open microphone/start of audio (18.4 seconds duration in total) 
### --> call stream_asr.start()
2022-12-20 01:56:58,831 (batch_beam_search_online_sim:90) DEBUG: Position: 0
2022-12-20 01:56:59,677 (batch_beam_search_online_sim:90) DEBUG: Position: 0
2022-12-20 01:57:00,500 (batch_beam_search_online_sim:90) DEBUG: Position: 0
2022-12-20 01:57:00,515 (batch_beam_search_online_sim:90) DEBUG: Position: 1
2022-12-20 01:57:00,577 (batch_beam_search_online_sim:90) DEBUG: Position: 2
2022-12-20 01:57:01,467 (batch_beam_search_online_sim:90) DEBUG: Position: 1
2022-12-20 01:57:01,539 (batch_beam_search_online_sim:90) DEBUG: Position: 2
...
2022-12-20 01:57:16,181 (batch_beam_search_online_sim:90) DEBUG: Position: 37
2022-12-20 01:57:16,493 (batch_beam_search_online_sim:90) DEBUG: Position: 36
2022-12-20 01:57:16,723 (batch_beam_search_online_sim:90) DEBUG: Position: 37
2022-12-20 01:57:17,768 (batch_beam_search_online_sim:90) DEBUG: Position: 36
2022-12-20 01:57:17,998 (batch_beam_search_online_sim:90) DEBUG: Position: 37
### after ~18 seconds, at this point complete hypo (streamed text return) is shown on screen
### --> microphone closed/end of audio 
### --> call stream_asr.end() which uses different beam_search call and starts decoding from position 0 again
2022-12-20 01:57:18,284 (beam_search:333) DEBUG: position 0
2022-12-20 01:57:18,313 (beam_search:333) DEBUG: position 1
2022-12-20 01:57:18,427 (beam_search:333) DEBUG: position 2
2022-12-20 01:57:18,541 (beam_search:333) DEBUG: position 3
2022-12-20 01:57:18,653 (beam_search:333) DEBUG: position 4
2022-12-20 01:57:18,772 (beam_search:333) DEBUG: position 5
...
2022-12-20 01:57:25,451 (beam_search:333) DEBUG: position 42
2022-12-20 01:57:25,682 (beam_search:333) DEBUG: position 43
2022-12-20 01:57:25,948 (beam_search:333) DEBUG: position 44
### --> stream_asr.end() returns after another ~8 seconds of delay

As you can see the asr_stream.end() function which calls self.batch_beam_search() will restart decoding at position 0 again causing a 8 sec delay after end of speech.

So I am wondering if the following line https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151
should be changed from

best_hyps = self.batch_beam_search(
            np.array(self.enc_feats, dtype=np.float32))

to

best_hyps = self.beam_search(
            np.array(self.enc_feats, dtype=np.float32))

making stream_asr.end() work with online beam search? Or am I using stream_asr.end() incorrectly here?

Hope this makes my question more clear?

sanjuktasr · 2023-02-08T11:14:08Z

Hi @espnetUser and @Masao-Someki ,
Did you verify if the onnx streaming inference and the original model streaming inference are the same. Apparently I am getting some different outputs for some cases. Is it possible? Please reply.

sanjuktasr · 2023-02-09T12:01:11Z

@Masao-Someki @espnetUser @ShigekiKarita @Fhrozen any one of you can you please tell if the batch beam search online is equivalent to the espnet bin inference streaming? Or atleast help me by giving a way to find that.

sanjuktasr · 2023-03-01T12:18:34Z

Hello My output for pth model is not coming same for onnx streaming asr model. PLease help.

Fhrozen · 2023-03-02T13:51:33Z

What do you mean that is not same. Could you share the logs so we could look at the any error.

sanjuktasr · 2023-03-03T04:40:30Z

What do you mean that is not same. Could you share the logs so we could look at the any error.

the code is running fine, but for the output for some sentences in my dataset, the onnx model inference output is not same as the pth model output. Here are some example below. appx 15% of output is not matching.

PTH_op: βCλə ζJ @bθəμə OCWCJ
ONNX_op: βCλə βCλə ζJ @bθəμə OCWCJ

PTH_op: ζB&ə ∞əF ∇əΩə ∇əΩə ψF∞θə ∞əF B!ə ζF∞C ζF∞C ψF∞θə OL λC@Bθəμə OəλL
ONNX_op:ζB&ə ∞əF ∇əΩə ∇əΩə ψF∞θə ∞əF B!ə ψF∞θə ψF∞θə OL λC@Bθəμə OəλL

PTH_op: OθB Bαə ∞əF VBλə ∞əF VBλə ψF∞θə ψF∞θə αB⊃Və ζαC∞ə JOə πL OL @bθəμə Oəλə ζəOə&J ΩK⊃
ONNX_op:OθB Bαə ∞əF VBλə ∞əF VBλə ζF∞θə ζF∞θə αB⊃Və &C∞ə JOə πL OL @bθəμə Oəλə ζəOə&J ΩK⊃

I will share the logs ASAP.

sanjuktasr · 2023-03-06T10:42:47Z

Hello @Fhrozen any idea. Actually I had switched off logs. Also there are no errors in the code. only the ONNX output mismatches with pth model output as I have given example before.

Fhrozen · 2023-03-06T10:52:05Z

Mismatch between pth and onnx models are common, and could be larger depending on the language. Just in case, try to change the hyperparameters for decoding, such as beam size, ctc-weight, and similars. You may find the config file in the same folder where the onnx models is located.

Masao-Someki · 2023-03-06T11:07:18Z

Hi @sanjuktasr, sorry for the late reply.
Just for clarification, which version of espnet_onnx do you use? If you use the latest PyPI version, would you clone this repository and check if the accuracy drop still occurs with the latest version on GitHub?
And please check the decoding configuration as @Fhrozen mentioned. Hyper parameters are defined in ~/.cache/espnet_onnx/<tag_name> in default.
The output of ONNX and PyTorch is not completely the same, but with CI tests we assume the difference is small enough to get the same hypothesis.

sanjuktasr · 2023-03-06T12:57:57Z

Hey thank you. I have verified the configurations several times. Some of the output are not matching. w.r.t to pth model inference. I will get back to you ASAP on the other things like ONNX version. We can connect if possible to understand the whats going wrong.

sanjuktasr · 2023-03-08T05:33:15Z

@Masao-Someki @Fhrozen
espnet-onnx 0.1.10
onnx 1.12.0
onnxruntime 1.13.1

Fhrozen · 2023-03-08T07:00:11Z

I do not think that kind of details are enough.
You need to modify the values in your config.yml file, and should be something like this:

beam_search:
  beam_size: 5
  maxlenratio: 0.0
  minlenratio: 0.0
  pre_beam_ratio: 1.5
  pre_beam_score_key: full
ctc:
  model_path: full/ctc.onnx
  quantized_model_path: quantize/ctc_qt.onnx
decoder:
  dec_type: XformerDecoder
  model_path: full/xformer_decoder.onnx
  n_layers: 6
  odim: 512
  quantized_model_path: quantize/xformer_decoder_qt.onnx

You may see the details about beam_search, which are required for you to change.

sanjuktasr · 2023-03-08T09:42:37Z

@Fhrozen
beam_search:
beam_size: 10
maxlenratio: 0.0
minlenratio: 0.0
pre_beam_ratio: 1.5
pre_beam_score_key: full

This part is for beam search and I have checked the configuration for other modules also. They look fine.

What could be the probable causes of error other than configuration

frontend
encoder
decoder
ctc
beam search
How should I debug this problem can you please help on that. Also FYI my model is not quantized.

sanjuktasr · 2023-03-09T10:22:55Z

Hello @Fhrozen @Masao-Someki , I have checked the configs thoroughly several times. but there are no issues there. Can you tell me what are the possible reasons for this issue. I am using the original available code base.

Masao-Someki · 2023-03-09T12:53:11Z

@sanjuktasr
Is the frontend output the same? We fixed a librosa issue before (#71), so this might be a cause.
If the frontend output is the same, maybe we have some issues with beam search. I cannot work on this project on weekdays, so I will see if there is any bug with streaming asr this weekend.

sanjuktasr · 2023-03-13T04:40:38Z

@Masao-Someki
I checked for the available code it was not the same. I enforced the frontend to be same using the original pth model frontend values and apply it to onnx configuration. But still no improvements. Although the same sentences are not giving errors for this modification.
Thanks a lot @Masao-Someki and do let me know if you find any bugs or issues for which this might be the issue.

sanjuktasr · 2023-03-14T11:09:55Z

@Masao-Someki @Fhrozen
I have used tried to maintain the same code for inference streaming for frontend and beamsearch, and changed only the encoder to ONNX. The results didint match again. The output for ONNX part was almost gibberish. Please tell me if I can modify this strategy or implement some other strategy.

Masao-Someki · 2023-03-15T11:40:33Z

@sanjuktasr

I checked for the available code it was not the same. I enforced the frontend to be same using the original pth model frontend values and apply it to onnx configuration. But still no improvements.

Would you check if the stft configuration is using the correct padding mode as follows in stft.py:

stft_kwargs = dict(
            n_fft=self.config.n_fft,
            win_length=self.config.win_length,
            hop_length=self.config.hop_length,
            center=self.config.center,
            window=self.config.window,
            pad_mode="reflect", # <- check this line
        )

do let me know if you find any bugs or issues for which this might be the issue.

I've found an index issue during the inference, and I'm working on this.
You can fix the issue by deleting the +1 in streaming.py like:

offset = (
                self.config.block_size
                - self.config.look_ahead
                - self.config.hop_size
            ) # delete +1 here

The model output would be the same with this bugfix, but the resulting sentence might differ. I have changed the beam search in the end() function, so I think this change is the cause.

sanjuktasr · 2023-03-15T12:41:16Z

@Masao-Someki ok thanks will check and let you know. Thanks for the update.

sanjuktasr · 2023-03-15T13:44:15Z

I made 2 changes:
1. fixed the issue in streaming.py, offset variable
2. changed the batch_beam_search to beam_search in end function
still no changes in accuracy.
I am using the features from pth model version output.

Masao-Someki · 2023-03-15T15:05:44Z

@sanjuktasr
To obtain the same result, I think we need to use the batch_beam_search_online (https://github.com/espnet/espnet/blob/master/espnet/nets/batch_beam_search_online.py)

sanjuktasr · 2023-03-16T11:50:13Z

Thanks @Masao-Someki, will try that and update.

sanjuktasr · 2023-03-17T06:36:58Z

@Masao-Someki
Have implemented the batch beam search online for the code. still no improvement.
is it possible that the onnx export might cause these deviations? Please do let me know.
Thanks and regards. :-)

Masao-Someki · 2023-03-21T11:22:31Z

Hi @sanjuktasr and @espnetUser, thank you for your reports; I fixed streaming-related bugs in #83.
I removed BatchedBeamSearch in the end function in this PR because we do not necessarily need this as @espnetUser pointed out.

sanjuktasr · 2023-03-21T13:25:42Z

HI @Masao-Someki @espnetUser ,
After implementing the fixes there are still issues FYI,
0
φCμə θF @bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə ρCλL ρCλL ζCOζə ζCOζə ⊂λC ⊂λC
φCμə θF @bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə ρCλL ρCλL ζCOζə ⊂λC ⊂λC
1
OMμə αλCφCθəζə ∞əεγəλə
OMμə αλCφCθəζə ∞əε
2
BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F !F ζJφJ∞ə βLλə
BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F !F ζJ
3
OBC∞@μC μCSOə !F ζCOζə βLλə ζJφJ∞ə ρCλL !F ζJφJ∞ə ρCλL !F φə∞ə !F
OBC∞@μC μCSOə !F ζCOζə βLλə ζJφJ∞ə ρCλL !F ζJφJ∞ə ρCλL !F φə∞ə !F
4
φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βLλə ⊂λə ζJφJ∞ə βLλə
φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βBCφə ⊂λC ζJφJ∞ə βLλ
5
OF@ə θF OMμə μBζ!ə @bθəμ@ə ∞əεγəλə
OF@ə θF OMμə μBζ!ə @bθəμ@ə ∞əεγəλ
6
φCμə θF @bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə ρCλL ρCλL ζCOζə ζCOζə ⊂λC ⊂λC
φCμə θF @bθəμə !F ζJφJ∞ə βLλə ζCOζə ζJφJ∞ə WCλL ρCλL ζCOζə ζCOζə ⊂λ
7
OMμə αλCφCθəζə ∞əεγəλə
OMμə αλCφCθəζə ∞əεγəλ
8
BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F ζJφJ∞ə βLλə
BC φB∞!ə !F Oə∞JO!ə !F ζJφJ∞ə ζJφJ∞ə ζCOζə ρCλL !F !F !F !F ζJφJ∞ə
9
φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βBCφə ⊂λC ζJφJ∞ə βLλə
φF@ə θF αμCρə Oə∞JO!ə εC !F ζJφJ∞ə βLλə ⊂λC βLλə βLλə βLλə βBCφə ⊂λC ζJφJ∞ə βLλə
.
The issues mostly consists of incomplete speech. Also some other issues are there.
Thanks for the fix any way. The onnx modules(encoder and decoder) are working fine.
Please help me fix them. Thanks a lot for your help. :-)

Masao-Someki · 2023-03-21T15:52:06Z

@sanjuktasr
Thank you, it looks like the final look-ahead tensor is not recognized.
I think we need to modify the following line to include the final look-ahead tensor.

espnet_onnx/espnet_onnx/asr/asr_streaming.py

Lines 132 to 136 in 46b06f1

    
           process_num = (len(speech) - self.initial_wav_length) // self.hop_size + 1 
        
           logging.info(f"Processing audio with {process_num + 1} processes.") 
        
           padded_speech = self.pad( 
        
               speech, length=process_num * self.hop_size + self.initial_wav_length 
        
           )

to

 process_num = (len(speech) - self.initial_wav_length + look_ahead_wav_len) // self.hop_size + 1

where

look_ahead_wav_len = (
            self.config.encoder.frontend.stft.hop_length
            * self.config.encoder.subsample
            * self.config.encoder.look_ahead
            + (
                self.config.encoder.frontend.stft.n_fft
                // self.config.encoder.frontend.stft.hop_length
            )
            * self.config.encoder.frontend.stft.hop_length
        )

sanjuktasr · 2023-03-22T11:59:41Z

HI @Masao-Someki ,
The issue still persists,

pth : αμCρə OMμə ∞BC∞ə ζJφJ∞ə ⊂λC ζCOζə ∞BC∞ə βBCφə J!ə !F ⊂λC ζJφJ∞ə
hyp : αμCρə OMμə ∞BC∞ə ζJφJ∞ə ⊂λC ζCOζə ∞BC∞ə βBCφə J!ə !F ⊂λC ζJφJ∞

pth : OF@ə θF αμCρə Oə∞JO!ə εC !F ζCOə ζJφJ∞ə J!ə βLλə J!J@ə ρCλLζCOζə ζJφJ∞ə βLλə
hyp : OF@ə θF αμCρə Oə∞JO!ə εC !F ζCOζə ζJφJ∞ə J!ə βLλə J!JJ@ə ζCOζə ζJφJ∞ə βLλə

last character is still a issue, also some characters are substituted. ideally speaking there is degradation of accuracy in this model. please let me know if there is anything can be done to resolve this issue.
Also since the encoder has 2 dec places precision can these kind of anomalies be expected?
Thanks and Regards

sanjuktasr · 2023-03-23T12:35:39Z

Hi @Masao-Someki ,
The issue of onnx encoder-decoder module is solved as I have checked, now the precision is also fine. but still the mismatch pertains with similar kind of issues. Please kindly help me in identifying the issue.
also how could padding the speech impact in any manner?

sanjuktasr · 2023-03-24T05:30:46Z

hi @Masao-Someki

I added a padding process to calculate the final part of the audio file. Usually, in the contextual cfm/trf block model, we use a look-ahead tensor, which is future information. I thought that the final word was included in the look-ahead and was not calculated in the encoder layer.

okay so missing words/characters especially at the last are caused by some issues in the look ahead tensor which is not getting processed through encoder? so streaming.py in the encoder is the code to be fixed here or something else too?

sanjuktasr · 2023-03-24T07:45:09Z

hi @Masao-Someki ,

@sanjuktasr About 2, you have to set pad_mode='reflect'. This is because the default pad_mode for the torch.stft is reflect, while librosa.stft is 'constant'
about this, this did not impact the result by any means and the errors remain same for the current configurations and onnx model(parity check for onnx model works fine).

There must be some issue in block processing related to buffer or look ahead. Could you please let me know if I am right on this?

Masao-Someki · 2023-03-26T08:47:27Z

Hi @sanjuktasr, I noticed that the contextual block is not correctly padded to the tensor for the final inference. I fixed this issue in #85.
And for the accuracy issue, please check your stft config. If the center is True, then you need to set it to False. The simulation script in ESPnet applies stft to the whole wav file at the beginning, while the onnx streaming script incrementally applies stft for the wav with hop_size length. So we need to set center=False to get the same stft output.

sanjuktasr · 2023-03-27T05:23:59Z

hey @Masao-Someki , Thanks for your help again,
But unfortunately this solution too didnt work. Also changing the center=False degraded the code, so I kept center = true.
Also there are no improvements after making the changes in the encoder.(The outputs are same as previous)
How to debug this issue?what are the other points any idea?

Masao-Someki · 2023-03-27T13:38:59Z

@sanjuktasr ,
If you got an accuracy drop with center=False, then I think there is a problem with your training config. Look at the following example; we get almost the same result with center=False in the streaming context.

# import librosa, torch, and numpy

stft_kwargs_librosa = dict(
    n_fft=512,
    win_length=512,
    hop_length=160,
    center=False,
    window='hann',
    pad_mode="reflect",
)
stft_kwargs_torch = dict(
    n_fft=512,
    win_length=512,
    hop_length=160,
    center=False,
    window=torch.hann_window(512),
    pad_mode="reflect",
)

a = np.random.random(32000)
ol = librosa.stft(a[:16000], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
ot = torch.stft(torch.from_numpy(a), **stft_kwargs_torch)
otp = ot[..., 0] ** 2 + ot[..., 1] ** 2

((olp - otp.numpy()[:, :olp.shape[1]]) ** 2).mean()
# 3.1480377431290635e-09

ol = librosa.stft(a[16000:], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
((olp - otp.numpy()[:, -olp.shape[1]:]) ** 2).mean()
# 3.0487974021161266e-09

However, since the stft function will add padding to both the beginning and the end of the wav, center=True will get a different result.

stft_kwargs_librosa['center'] = True
stft_kwargs_torch['center'] = True

a = np.random.random(32000)
ol = librosa.stft(a[:16000], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
ot = torch.stft(torch.from_numpy(a), **stft_kwargs_torch)
otp = ot[..., 0] ** 2 + ot[..., 1] ** 2

((olp - otp.numpy()[:, :olp.shape[1]]) ** 2).mean()
# 35.21981594009537

ol = librosa.stft(a[16000:], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
((olp - otp.numpy()[:, -olp.shape[1]:]) ** 2).mean()
# 29.573234833762136

In the streaming context, we need to incrementally apply stft to the wav, so we need to set center=False.
The simulation script in ESPnet will apply stft to the whole wav at once, so the center might not affect the accuracy.

sanjuktasr · 2023-04-03T09:56:47Z

hey @Masao-Someki ,

did the training after changing stft config, no improvments.
while I was testing the encoder unit, I found that the state variables are having different configs and values from torch models(The enc/ys variable O/P of variable is matching although). Could you please clarify
Thanks :-)

Masao-Someki · 2023-04-11T14:27:46Z

@sanjuktasr Sorry for the late reply,

did the training after changing stft config, no improvements.

Then I think we need step-by-step debugging. Please check the following.

Is your configuration correct? (e.g., block size, hop size, beam size...)
If you are using torch==1.12, then would you try exporting your model with torch==1.13? I've found that PyTorch==1.12 contains some onnx-related issue.
Is the stft output the same as with torch.stft?
If the stft output almost the same? Diff between torch. stft and librosa. stft should be smaller than 7e-6, as tested in here.
If you compute a long sequence with your encoder, is the onnx output and PyTorch output the same?
Since the simulation script of espnet_onnx will compute the encoder block incrementally, and the espnet script will compute at once, please confirm that the encoder output is the same.
Does your encoder/decoder contains additional make_pad_mask or LengthRegulator?
These modules can be converted to onnx but may result in an accuracy drop.

I found that the state variables are having different configs and values from torch models

Yes, the state variables for decoder is not the same. Since we cannot export onnx model with if sentence, I have made these changed.

sanjuktasr · 2023-04-12T10:47:28Z

thanks @Masao-Someki

sanjuktasr · 2023-04-17T04:27:39Z

@Masao-Someki, I realised the beam search that the torch is using is not same as onnx.

sanjuktasr · 2023-04-20T07:25:05Z

@Masao-Someki ,
the batch beam search online implemented is not giving the same results as batch beam search PTH model. Right now in github implementation the beam_search call function is getting implemented instead of the forward batch beam search..

While implementing the forward/call function of beam_search.py there is 100% match between 0NNX and PTH. but the absolute model accuracy is getting dropped.

sanjuktasr · 2023-04-26T09:02:59Z

@Masao-Someki ,
the batch beam search online is give null/blank hypothesis and the sequences are not coming same as the torch model. please help.
P.S. the models(ONNX and torch) are working fine till encoder block.

Masao-Someki · 2023-04-26T14:12:26Z

@sanjuktasr
Sorry for the late reply. Since we remove PyTorch from dependency for inference use, the forward function is replaced with the call function. The BatchBeamSearch and the BatchBeamSearchOnline are copied from the original classes and replaced with all torch-related functions into the corresponding numpy function.

the batch beam search online is give null/blank hypothesis and the sequences are not coming same as the torch model.
P.S. the models(ONNX and torch) are working fine till encoder block.

Does that mean that the input to the decoder is the same as the torch implementation? If the decoder input including states are the same, but the output is different, then we may have a bug somewhere in the decoder process...

Masao-Someki · 2023-04-26T15:29:47Z

@sanjuktasr Am I right that your current model is trained with center=False? The simulation script in espnet does not support frontend with center=False since it applies stft to the full audio, so I think we need to modify the espnet script a little.

If your model is trained with center=True, then center in espnet stft should be True and espnet_onnx should be False. The first and the last frame might be different, because of the padding setting. You need to set center=False in espnet_onnx because the corresponding audio for each chunk is not padded during training.
If your model is trained with center=False, then center in espnet stft should be False and espnet_onnx should be False. I have never trained with center=False tested this, but I think this setting would work because the padding strategy is the same for each chunk.

sanjuktasr · 2023-04-28T05:35:32Z

@Masao-Someki , one of the issues i have seen in case of decoder batch score function is that it is appending zeros in the 0th element of the states variable.

sanjuktasr · 2023-04-28T12:49:21Z

@Masao-Someki I found an issue in the decoder batch_score function states variable. there is an array of zeros getting appendede to the original states.

Masao-Someki · 2023-04-29T10:49:11Z

@sanjuktasr
Appending the zero state is to avoid spliting the decoder model into two models. Since we cannot use if sentence during the onnx inference, if we try to use the same states with PyTorch, then we need two onnx models for the first inference without states and the latter inference with states.
The zero states are ignored inside the model (see below L55), so the output does not change.

espnet_onnx/espnet_onnx/export/asr/models/decoders/xformer.py

Lines 52 to 55 in c074393

    
           for c, decoder in zip(cache, self.model.decoders): 
        
               x, mask = decoder(x, mask, memory, None, c) 
        
               new_cache.append(x) 
        
               x = x[:, 1:, :]

sanjuktasr · 2023-05-02T08:07:20Z

@Masao-Someki , so what could be the issue that the batch beam search online call function is not reproducing the same inference results. getting blank inference output. please help.

sanjuktasr · 2023-05-03T09:38:12Z

@Masao-Someki issue is resolved. thanks a lot for your help.

espnetUser · 2023-07-20T09:35:48Z

Hi @Masao-Someki,

I am seeing a similar problem as @sanjuktasr with poor performance of my espnet_onnx model when compared to the espnet2 pytorch version.

I am focusing only on the streaming encoder part though and noticed that the encoder outputs are quite different between onnx and pytorch models. I went through this issue and followed some suggestions but so far nothing helped to resolve the problem.

Based on your discussion with @sanjuktasr I started debugging this in more detail following your list of points above and found the following:

If you compute a long sequence with your encoder, is the onnx output and PyTorch output the same?

Interestingly, the onnx and pytorch outputs do match exactly for the first interation
but then start to drift apart quickly:

This is the encoder output over time for one of the 512 outputs and it matches exactly the pytorch output up to frame 45 which is the end of the first iteration chunk of input data (https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#L138-L141).

Is the stft output the same as with torch.stft?

Because for the first chunk encoder outputs between espnet2 pytroch and onnx match stft can be ruled out as a cause here, right?

Since the simulation script of espnet_onnx will compute the encoder block incrementally, and the espnet script will compute at once, please confirm that the encoder output is the same.

I am using the streaming encoder together with the espnet2 script (https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py) to extract the encoder output for the pytorch model. From my understanding this script does not compute the outputs at once but processing is done block-wise, so stft is not applied to entire waveform at once but chunkwise. There is some trimming code to handle stft padding effects (https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L256-L281) which I don't see in espnet_onnx streaming code.

Because the first iteration outputs match between espnet_onnx and espnet2 I am thinking the differences must somehow come from the different audio chunking/buffering/trimming code between espnet_onnx and espnet streaming scripts.

I would welcome and appreciate any pointers how to match the encoder outputs between espnet_onnx and espnet2 pytorch models.

Thanks!

Masao-Someki · 2023-07-20T14:25:56Z

Hi @espnetUser, thank you for your comment.
Since the first iteration is completely matched, I think this problem is related to STFT or other front-end-related parts.
From your figure, it seems that the Pytorch line shifts to the right while the shape of these lines is similar. So I think this is caused by the padding or trimming part, as you mentioned.
I'm still unsure of the cause, but if the _extract_feats function uses torch.stft with the default padding settings, then it might be the cause of this problem.

espnetUser · 2023-07-25T09:40:30Z

Hi @Masao-Someki, thank you very much for your prompt reply.

So I think this is caused by the padding or trimming part, as you mentioned.

Over the last days I have been working on comparing different frontends from espnet and espnet-onnx in order to determine if differences can be explained by padding/trimming parts.

Here is a snapshot for filterbank channel 3 over time (frames) for the original streaming espnet (PYTORCH-FBANK) and original espnet_onnx (ONNX_ORG_FBANK) as well as modified espnet_onnx (ONNX_MOD_FBANK) where I replaced feats, feat_length = self.frontend(speech, speech_length) with espnet method apply_frontend() from https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L203

The figure shows that there is indeed a difference between frontend features of espnet and espnet_onnx after initial_wav_length that can be explained due to trimming/padding effects. To eliminate any mismatch in filterbank features I replaced the espnet_onnx frontend code with espnet apply_frontend() method and then the frontend features match exactly (see green and orange curves).

However, there is still the shift in the enoder outputs after first iteration even when the filterbank input features to the encoder match exactly:

So I am thinking there is some internal mismatch in how the encoder buffers/processes the chunks that leads to a mismatch/shift in encoder outputs ...

sanjuktasr · 2023-07-25T10:05:33Z

Check the encoder states. AS far as I remember there was issue in next_state variable value. The enc out mismatch was therefore starting from the 2nd instant

espnetUser · 2023-07-26T08:49:07Z

@sanjuktasr: Thanks for your reply.

AS far as I remember there was issue in next_state variable value.

encoder_out, next_states = self.forward_encoder(feats, states)

Would this be the right place to look for computation of next_state variable?

https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/export/asr/models/encoders/contextual_block_xformer.py#L61-L160

espnetUser · 2023-07-26T08:54:53Z

@sanjuktasr: Do you remember which entry in next_states dict was causing the mismatch?

espnetUser · 2023-08-17T13:18:29Z

@Masao-Someki, @sanjuktasr: I checked the streaming espnet_onnx encoder code for anything that looks suspicious and found there is a "-1" in the res_frame_num calculation in this line here:

https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/export/asr/models/encoders/contextual_block_xformer.py#L109

which is not part of the espnet2 streaming encoder code:

https://github.com/espnet/espnet/blob/master/espnet2/asr/encoder/contextual_block_conformer_encoder.py#L487

After removing the "-1" the encoder outputs looked much more in line with my espnet2 pytorch model:

@Masao-Someki: Could you please double-check the "-1" for the res_frame_num calculation in espnet_onnx encoder code?

Masao-Someki · 2023-08-27T02:19:27Z

@espnetUser
Thank you, I didn't notice this point!
I think you are correct. It should be implemented in the same manner as line 94, so -1 is incorrect.

espnetUser · 2023-09-28T13:13:48Z

Thanks for confirming @Masao-Someki.

Two follow up questions:

Should I prepare a PR to fix this?
I noticed that new Espnet release now implements onnx-convertible make_pad_mask function and there is WIP Feature/add encoder support #89. Any timeline for when espnet_onnx will support direct export of ESPnet encoders to onnx format?

Thanks!

Masao-Someki · 2023-10-09T13:22:57Z

Sorry for the late replay, @espnetUser

It's nice to have a PR for this, but since this bugfix is tiny, I will include this in upgrade to v2 #96
I started implementing this in upgrade to v2 #96, and everything should be done the next weekend. Since this change is based on the new make_pad_mask and is incompatible with the past espnet versions, I need to be careful about the version conflicts of dependencies.

espnetUser · 2023-10-11T09:46:59Z

@Masao-Someki: Thank you for replying and the update on #96. Looking forward trying it out soon! :)

Masao-Someki mentioned this issue Mar 19, 2023

Bugfix/streaming acc drop #83

Merged

2 tasks

Masao-Someki added a commit to Masao-Someki/espnet_onnx that referenced this issue Oct 14, 2023

Bug fix for espnet#76

729597c

Question on stream_asr.end() function for streaming asr #76

Question on stream_asr.end() function for streaming asr #76

Comments

espnetUser commented Dec 13, 2022

Masao-Someki commented Dec 17, 2022

espnetUser commented Dec 20, 2022

sanjuktasr commented Feb 8, 2023 • edited Loading

sanjuktasr commented Feb 9, 2023 • edited Loading

sanjuktasr commented Mar 1, 2023

Fhrozen commented Mar 2, 2023

sanjuktasr commented Mar 3, 2023 • edited Loading

sanjuktasr commented Mar 6, 2023

Fhrozen commented Mar 6, 2023

Masao-Someki commented Mar 6, 2023

sanjuktasr commented Mar 6, 2023

sanjuktasr commented Mar 8, 2023

Fhrozen commented Mar 8, 2023

sanjuktasr commented Mar 8, 2023 • edited Loading

This part is for beam search and I have checked the configuration for other modules also. They look fine.

sanjuktasr commented Mar 9, 2023

Masao-Someki commented Mar 9, 2023

sanjuktasr commented Mar 13, 2023

sanjuktasr commented Mar 14, 2023

Masao-Someki commented Mar 15, 2023

sanjuktasr commented Mar 15, 2023

sanjuktasr commented Mar 15, 2023

Masao-Someki commented Mar 15, 2023

sanjuktasr commented Mar 16, 2023

sanjuktasr commented Mar 17, 2023

Masao-Someki commented Mar 21, 2023

sanjuktasr commented Mar 21, 2023 • edited Loading

Masao-Someki commented Mar 21, 2023

sanjuktasr commented Mar 22, 2023 • edited Loading

sanjuktasr commented Mar 23, 2023

sanjuktasr commented Mar 24, 2023

sanjuktasr commented Mar 24, 2023 • edited Loading

Masao-Someki commented Mar 26, 2023

sanjuktasr commented Mar 27, 2023 • edited Loading

Masao-Someki commented Mar 27, 2023 • edited Loading

sanjuktasr commented Apr 3, 2023

Masao-Someki commented Apr 11, 2023

sanjuktasr commented Apr 12, 2023

sanjuktasr commented Apr 17, 2023

sanjuktasr commented Apr 20, 2023 • edited Loading

sanjuktasr commented Apr 26, 2023 • edited Loading

Masao-Someki commented Apr 26, 2023

Masao-Someki commented Apr 26, 2023

sanjuktasr commented Apr 28, 2023

sanjuktasr commented Apr 28, 2023

Masao-Someki commented Apr 29, 2023

sanjuktasr commented May 2, 2023

sanjuktasr commented May 3, 2023

espnetUser commented Jul 20, 2023

Masao-Someki commented Jul 20, 2023 • edited Loading

espnetUser commented Jul 25, 2023 • edited Loading

sanjuktasr commented Jul 25, 2023 • edited Loading

espnetUser commented Jul 26, 2023

espnetUser commented Jul 26, 2023

espnetUser commented Aug 17, 2023

Masao-Someki commented Aug 27, 2023

espnetUser commented Sep 28, 2023

Masao-Someki commented Oct 9, 2023

espnetUser commented Oct 11, 2023

sanjuktasr commented Feb 8, 2023 •

edited

Loading

sanjuktasr commented Feb 9, 2023 •

edited

Loading

sanjuktasr commented Mar 3, 2023 •

edited

Loading

sanjuktasr commented Mar 8, 2023 •

edited

Loading

sanjuktasr commented Mar 21, 2023 •

edited

Loading

sanjuktasr commented Mar 22, 2023 •

edited

Loading

sanjuktasr commented Mar 24, 2023 •

edited

Loading

sanjuktasr commented Mar 27, 2023 •

edited

Loading

Masao-Someki commented Mar 27, 2023 •

edited

Loading

sanjuktasr commented Apr 20, 2023 •

edited

Loading

sanjuktasr commented Apr 26, 2023 •

edited

Loading

Masao-Someki commented Jul 20, 2023 •

edited

Loading

espnetUser commented Jul 25, 2023 •

edited

Loading

sanjuktasr commented Jul 25, 2023 •

edited

Loading