-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNX model takes too long to run #7129
Comments
What model are you trying ? If it's RNNT, then onnx is indeed a bit slower. |
I am using stt_en_fastconformer_transducer_xlarge. Yes, it's RNNT. |
Also, does the model performance change after exporting to ONNX? |
Onnx inference does not have bound inputs on gpu, so there is too much CPU GPU transfer causing it to slow down. The exported model is itself not slower by any significant measure. Pytorch/torchscript can keep all tensors on gpu so if appears faster |
Okay, that makes sense. Also, I found out that using TensorRT provider was making the inference slower so I changed it to CUDA. When I am exporting a large model from nemo to onnx, I am getting separate files for encoder, decoder and weights. Is there any way to get a single file? I asked this question on another issue (#6759) as well but I didn't get how to resolve it. |
For RNNT, it's not possible to get single onnx file. RNNT encoder needs to run only once for a given sample, decoder and joint need to run autoregressively multiple times to produce tokens. |
Oh okay, so to use the onnx models of encoder and decoder for inference, how do we figure out the number of times decoder should run? Also, exporting nemo to onnx generates weight and bias files as well, how are they to be used with onnxruntime? |
Nemo model when exported should generate two onnx file not weights (PT) file. As to how many times to run - it's dynamic and takes logic to figure out stopping condition. So you'll need to use the code in the export script to find out what token to stop at. |
RNNT models aren't exported with that script. You should use the one in examples ASR export |
Actually this script should also work. @borisfom any idea what's up here ? Either onnx or transducer should work. But not hybrid models, we have to figure out how to export hybrid models properly |
Thank you for your help @titu1994! I have another question: can I fine tune an existing model to recognize a set of specific words/terms? If yes, how much data is needed? I want the model to recognize some terms related to healthcare/medicine that are not used as frequently as others. |
You can do that with very low LR finetuning, or with adapters. See adapter tutorial in ASR tutorial section |
I will check it out. Thank you! |
Hello @titu1994, does nemo have a C++ API or something similar? I want to deploy nemo models in a C++ based production environment for inference. I have tried converting a nemo model into ONNX and then using it for inference. But in this case, I had to convert the preprocessing/postprocessing classes/code stubs from nemo source code into C++ and it wasn't very efficient. |
Actually NeMo preprocessor can be exported to onnx via Torchaudio backend. It requires a few extra steps but it should be supported quite well - 1:1 input output correspondence. It was contributed by a user here - #5512 Seems we never documented this in Nemo docs for some reason. I will fix that soon. So when preprocessor is exportable, you can then simply run the full pipeline in C++ |
Ah it's not onnx but instead torchscript export. Still, there is a c++ API for that so it should be ok I think ? |
You mean converting the preprocessor from torchscript to onnx and then using it? Or is it something else? |
Okay, this makes sense, I'll check it out. Thank you! |
I meant using the torchscript c++ backend for the preprocessor and then onnx or TS backend for the model |
I used the torchscript c++ backend for the preprocessor and onnx backend for inference. The preprocessor outputs are signal data and sample rate. In 50% cases, the signal data has all 'NaN'. But the preprocessor runs perfectly when used with python. I am loading audio using torchaudio in python and using AudioFile (https://github.com/adamstark/AudioFile) in c++. I tried using torchaudio::sox::load_audio_file (https://github.com/pytorch/audio/blob/main/torchaudio/csrc/sox/io.cpp) but this gives 0's in the preprocessor output (when reading the signal data as a 2D vector, alternate rows have all 0 elements). Is this error related to the preprocessor? |
Hmm dunno about that. From the tests in that PR, the processor gives exact same output as Torchaudio processor |
Okay, I'll figure something out. Thank you! |
I found the issue - the output of torchaudio.load/torchaudio::sox::load_audio_file is signal data, sample rate. I interpreted it as signal data, signal length. It works properly now. Thank you. |
@divyansh2681 Can you please suggest what files you needed converted to c++, or if you have any steps that would greatly help me. Thank you |
Hey @nabil6391, NeMo has some python scripts to extract the preprocessor and vocabulary files for the model. Once you have the preprocessor and postprocessor, you can use ONNXruntime C++ library to run inference. |
I have exported a NeMo ASR model to ONNX. When I run the ONNX model for inference, it takes much much more time as compared to the original .nemo model. I am running both of them on the same machine in the same conda environment.
The text was updated successfully, but these errors were encountered: