-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on stream_asr.end() function for streaming asr #76
Comments
Hi @espnetUser, |
Hi @Masao-Someki, my question only concerns the beam search call in Let me explain with an example. Here is a list of debug logs that show the times and beam searches used as well as the position (output index) in the beam search plus some comments about when stream_asr.start()/end() calls were made:
As you can see the asr_stream.end() function which calls So I am wondering if the following line https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151
to
making Hope this makes my question more clear? |
Hi @espnetUser and @Masao-Someki , |
@Masao-Someki @espnetUser @ShigekiKarita @Fhrozen any one of you can you please tell if the batch beam search online is equivalent to the espnet bin inference streaming? Or atleast help me by giving a way to find that. |
Hello My output for pth model is not coming same for onnx streaming asr model. PLease help. |
What do you mean that is not same. Could you share the logs so we could look at the any error. |
the code is running fine, but for the output for some sentences in my dataset, the onnx model inference output is not same as the pth model output. Here are some example below. appx 15% of output is not matching. PTH_op: βCλə ζJ @bθəμə OCWCJ PTH_op: ζB&ə ∞əF ∇əΩə ∇əΩə ψF∞θə ∞əF B!ə ζF∞C ζF∞C ψF∞θə OL λC@Bθəμə OəλL PTH_op: OθB Bαə ∞əF VBλə ∞əF VBλə ψF∞θə ψF∞θə αB⊃Və ζαC∞ə JOə πL OL @bθəμə Oəλə ζəOə&J ΩK⊃ I will share the logs ASAP. |
Hello @Fhrozen any idea. Actually I had switched off logs. Also there are no errors in the code. only the ONNX output mismatches with pth model output as I have given example before. |
Mismatch between pth and onnx models are common, and could be larger depending on the language. Just in case, try to change the hyperparameters for decoding, such as beam size, ctc-weight, and similars. You may find the config file in the same folder where the onnx models is located. |
Hi @sanjuktasr, sorry for the late reply. |
Hey thank you. I have verified the configurations several times. Some of the output are not matching. w.r.t to pth model inference. I will get back to you ASAP on the other things like ONNX version. We can connect if possible to understand the whats going wrong. |
@Masao-Someki @Fhrozen |
I do not think that kind of details are enough.
You may see the details about |
@Fhrozen This part is for beam search and I have checked the configuration for other modules also. They look fine.What could be the probable causes of error other than configuration
|
Hello @Fhrozen @Masao-Someki , I have checked the configs thoroughly several times. but there are no issues there. Can you tell me what are the possible reasons for this issue. I am using the original available code base. |
@sanjuktasr |
@Masao-Someki |
@Masao-Someki @Fhrozen |
Would you check if the stft configuration is using the correct padding mode as follows in stft.py: stft_kwargs = dict(
n_fft=self.config.n_fft,
win_length=self.config.win_length,
hop_length=self.config.hop_length,
center=self.config.center,
window=self.config.window,
pad_mode="reflect", # <- check this line
)
I've found an index issue during the inference, and I'm working on this. offset = (
self.config.block_size
- self.config.look_ahead
- self.config.hop_size
) # delete +1 here The model output would be the same with this bugfix, but the resulting sentence might differ. I have changed the beam search in the |
@Masao-Someki ok thanks will check and let you know. Thanks for the update. |
I made 2 changes: |
@sanjuktasr |
Thanks @Masao-Someki, will try that and update. |
@Masao-Someki |
Hi @sanjuktasr and @espnetUser, thank you for your reports; I fixed streaming-related bugs in #83. |
HI @Masao-Someki @espnetUser , |
@sanjuktasr espnet_onnx/espnet_onnx/asr/asr_streaming.py Lines 132 to 136 in 46b06f1
to process_num = (len(speech) - self.initial_wav_length + look_ahead_wav_len) // self.hop_size + 1 where
|
HI @Masao-Someki , pth : αμCρə OMμə ∞BC∞ə ζJφJ∞ə ⊂λC ζCOζə ∞BC∞ə βBCφə J!ə !F ⊂λC ζJφJ∞ə pth : OF@ə θF αμCρə Oə∞JO!ə εC !F ζCOə ζJφJ∞ə J!ə βLλə J!J@ə ρCλLζCOζə ζJφJ∞ə βLλə last character is still a issue, also some characters are substituted. ideally speaking there is degradation of accuracy in this model. please let me know if there is anything can be done to resolve this issue. |
Hi @Masao-Someki , |
okay so missing words/characters especially at the last are caused by some issues in the look ahead tensor which is not getting processed through encoder? so streaming.py in the encoder is the code to be fixed here or something else too? |
hi @Masao-Someki ,
There must be some issue in block processing related to buffer or look ahead. Could you please let me know if I am right on this? |
Hi @sanjuktasr, I noticed that the contextual block is not correctly padded to the tensor for the final inference. I fixed this issue in #85. |
hey @Masao-Someki , Thanks for your help again, |
@sanjuktasr , # import librosa, torch, and numpy
stft_kwargs_librosa = dict(
n_fft=512,
win_length=512,
hop_length=160,
center=False,
window='hann',
pad_mode="reflect",
)
stft_kwargs_torch = dict(
n_fft=512,
win_length=512,
hop_length=160,
center=False,
window=torch.hann_window(512),
pad_mode="reflect",
)
a = np.random.random(32000)
ol = librosa.stft(a[:16000], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
ot = torch.stft(torch.from_numpy(a), **stft_kwargs_torch)
otp = ot[..., 0] ** 2 + ot[..., 1] ** 2
((olp - otp.numpy()[:, :olp.shape[1]]) ** 2).mean()
# 3.1480377431290635e-09
ol = librosa.stft(a[16000:], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
((olp - otp.numpy()[:, -olp.shape[1]:]) ** 2).mean()
# 3.0487974021161266e-09 However, since the stft function will add padding to both the beginning and the end of the wav, center=True will get a different result. stft_kwargs_librosa['center'] = True
stft_kwargs_torch['center'] = True
a = np.random.random(32000)
ol = librosa.stft(a[:16000], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
ot = torch.stft(torch.from_numpy(a), **stft_kwargs_torch)
otp = ot[..., 0] ** 2 + ot[..., 1] ** 2
((olp - otp.numpy()[:, :olp.shape[1]]) ** 2).mean()
# 35.21981594009537
ol = librosa.stft(a[16000:], **stft_kwargs_librosa)
olp = ol.real ** 2 + ol.imag ** 2
((olp - otp.numpy()[:, -olp.shape[1]:]) ** 2).mean()
# 29.573234833762136 In the streaming context, we need to incrementally apply stft to the wav, so we need to set center=False. |
hey @Masao-Someki , did the training after changing stft config, no improvments. |
@sanjuktasr Sorry for the late reply,
Then I think we need step-by-step debugging. Please check the following.
Yes, the state variables for decoder is not the same. Since we cannot export onnx model with |
thanks @Masao-Someki |
@Masao-Someki, I realised the beam search that the torch is using is not same as onnx. |
@Masao-Someki , While implementing the forward/call function of beam_search.py there is 100% match between 0NNX and PTH. but the absolute model accuracy is getting dropped. |
@Masao-Someki , |
@sanjuktasr
Does that mean that the input to the decoder is the same as the torch implementation? If the decoder input including states are the same, but the output is different, then we may have a bug somewhere in the decoder process... |
@sanjuktasr Am I right that your current model is trained with center=False? The simulation script in espnet does not support frontend with center=False since it applies stft to the full audio, so I think we need to modify the espnet script a little.
|
@Masao-Someki , one of the issues i have seen in case of decoder batch score function is that it is appending zeros in the 0th element of the states variable. |
@Masao-Someki I found an issue in the decoder batch_score function states variable. there is an array of zeros getting appendede to the original states. |
@sanjuktasr espnet_onnx/espnet_onnx/export/asr/models/decoders/xformer.py Lines 52 to 55 in c074393
|
@Masao-Someki , so what could be the issue that the batch beam search online call function is not reproducing the same inference results. getting blank inference output. please help. |
@Masao-Someki issue is resolved. thanks a lot for your help. |
Hi @Masao-Someki, I am seeing a similar problem as @sanjuktasr with poor performance of my espnet_onnx model when compared to the espnet2 pytorch version. I am focusing only on the streaming encoder part though and noticed that the encoder outputs are quite different between onnx and pytorch models. I went through this issue and followed some suggestions but so far nothing helped to resolve the problem. Based on your discussion with @sanjuktasr I started debugging this in more detail following your list of points above and found the following:
Interestingly, the onnx and pytorch outputs do match exactly for the first interation
Because for the first chunk encoder outputs between espnet2 pytroch and onnx match stft can be ruled out as a cause here, right?
I am using the streaming encoder together with the espnet2 script (https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py) to extract the encoder output for the pytorch model. From my understanding this script does not compute the outputs at once but processing is done block-wise, so stft is not applied to entire waveform at once but chunkwise. There is some trimming code to handle stft padding effects (https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L256-L281) which I don't see in espnet_onnx streaming code. Because the first iteration outputs match between espnet_onnx and espnet2 I am thinking the differences must somehow come from the different audio chunking/buffering/trimming code between espnet_onnx and espnet streaming scripts. I would welcome and appreciate any pointers how to match the encoder outputs between espnet_onnx and espnet2 pytorch models. Thanks! |
Hi @espnetUser, thank you for your comment. |
Hi @Masao-Someki, thank you very much for your prompt reply.
Over the last days I have been working on comparing different frontends from espnet and espnet-onnx in order to determine if differences can be explained by padding/trimming parts. Here is a snapshot for filterbank channel 3 over time (frames) for the original streaming espnet (PYTORCH-FBANK) and original espnet_onnx (ONNX_ORG_FBANK) as well as modified espnet_onnx (ONNX_MOD_FBANK) where I replaced The figure shows that there is indeed a difference between frontend features of espnet and espnet_onnx after However, there is still the shift in the enoder outputs after first iteration even when the filterbank input features to the encoder match exactly: So I am thinking there is some internal mismatch in how the encoder buffers/processes the chunks that leads to a mismatch/shift in encoder outputs ... |
Check the encoder states. AS far as I remember there was issue in next_state variable value. The enc out mismatch was therefore starting from the 2nd instant |
@sanjuktasr: Thanks for your reply.
Would this be the right place to look for computation of |
@sanjuktasr: Do you remember which entry in |
@Masao-Someki, @sanjuktasr: I checked the streaming espnet_onnx encoder code for anything that looks suspicious and found there is a "-1" in the which is not part of the espnet2 streaming encoder code: After removing the "-1" the encoder outputs looked much more in line with my espnet2 pytorch model: @Masao-Someki: Could you please double-check the "-1" for the |
@espnetUser |
Thanks for confirming @Masao-Someki. Two follow up questions:
Thanks! |
Sorry for the late replay, @espnetUser
|
@Masao-Someki: Thank you for replying and the update on #96. Looking forward trying it out soon! :) |
Hi @Masao-Someki,
In the readme the example for streaming asr shows the use of start() and end() methods:
In a real streaming scenario should the start() and end() methods be called whenever the microphone is opened and closed?
I am asking because I noticed that the end() function in https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 calls the
self.batch_beam_search()
function which will restart decoding from postion 0 again causing a rather large delay for longer speech inputs. If I change https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 to useself.beam_search()
method instead it avoids decoding the entire utterance at the end again and thus the delay.Could you please clarify why
self.batch_beam_search()
is used in stream_asr.end() function?Thanks!
The text was updated successfully, but these errors were encountered: