Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX model takes too long to run #7129

Closed
divyansh2681 opened this issue Jul 28, 2023 · 26 comments
Closed

ONNX model takes too long to run #7129

divyansh2681 opened this issue Jul 28, 2023 · 26 comments

Comments

@divyansh2681
Copy link

I have exported a NeMo ASR model to ONNX. When I run the ONNX model for inference, it takes much much more time as compared to the original .nemo model. I am running both of them on the same machine in the same conda environment.

@titu1994
Copy link
Collaborator

What model are you trying ? If it's RNNT, then onnx is indeed a bit slower.

@divyansh2681
Copy link
Author

divyansh2681 commented Jul 31, 2023

I am using stt_en_fastconformer_transducer_xlarge. Yes, it's RNNT.

@divyansh2681
Copy link
Author

What model are you trying ? If it's RNNT, then onnx is indeed a bit slower.

Also, does the model performance change after exporting to ONNX?

@titu1994
Copy link
Collaborator

Onnx inference does not have bound inputs on gpu, so there is too much CPU GPU transfer causing it to slow down. The exported model is itself not slower by any significant measure.

Pytorch/torchscript can keep all tensors on gpu so if appears faster

@divyansh2681
Copy link
Author

Okay, that makes sense. Also, I found out that using TensorRT provider was making the inference slower so I changed it to CUDA. When I am exporting a large model from nemo to onnx, I am getting separate files for encoder, decoder and weights. Is there any way to get a single file? I asked this question on another issue (#6759) as well but I didn't get how to resolve it.

@titu1994
Copy link
Collaborator

For RNNT, it's not possible to get single onnx file. RNNT encoder needs to run only once for a given sample, decoder and joint need to run autoregressively multiple times to produce tokens.

@divyansh2681
Copy link
Author

Oh okay, so to use the onnx models of encoder and decoder for inference, how do we figure out the number of times decoder should run? Also, exporting nemo to onnx generates weight and bias files as well, how are they to be used with onnxruntime?

@titu1994
Copy link
Collaborator

Nemo model when exported should generate two onnx file not weights (PT) file. As to how many times to run - it's dynamic and takes logic to figure out stopping condition. So you'll need to use the code in the export script to find out what token to stop at.

@divyansh2681
Copy link
Author

I used the export script on the nemo model. I got around 370 files in total. The encoder and decoder are two of them. I have attached the picture of how my file directory looks after exporting.
image
image
Do you think I am doing something wrong while exporting?

@titu1994
Copy link
Collaborator

RNNT models aren't exported with that script. You should use the one in examples ASR export

@titu1994
Copy link
Collaborator

Actually this script should also work. @borisfom any idea what's up here ?
Still you should use the following script only for RNNT https://github.com/NVIDIA/NeMo/tree/main/examples/asr/export/transducer

Either onnx or transducer should work. But not hybrid models, we have to figure out how to export hybrid models properly

@divyansh2681
Copy link
Author

Thank you for your help @titu1994!

I have another question: can I fine tune an existing model to recognize a set of specific words/terms? If yes, how much data is needed? I want the model to recognize some terms related to healthcare/medicine that are not used as frequently as others.

@titu1994
Copy link
Collaborator

titu1994 commented Aug 5, 2023

You can do that with very low LR finetuning, or with adapters. See adapter tutorial in ASR tutorial section

@divyansh2681
Copy link
Author

divyansh2681 commented Aug 7, 2023

I will check it out. Thank you!

@divyansh2681
Copy link
Author

Hello @titu1994, does nemo have a C++ API or something similar? I want to deploy nemo models in a C++ based production environment for inference. I have tried converting a nemo model into ONNX and then using it for inference. But in this case, I had to convert the preprocessing/postprocessing classes/code stubs from nemo source code into C++ and it wasn't very efficient.

@titu1994
Copy link
Collaborator

titu1994 commented Aug 23, 2023

Actually NeMo preprocessor can be exported to onnx via Torchaudio backend. It requires a few extra steps but it should be supported quite well - 1:1 input output correspondence.

It was contributed by a user here - #5512

Seems we never documented this in Nemo docs for some reason. I will fix that soon.

So when preprocessor is exportable, you can then simply run the full pipeline in C++

@titu1994
Copy link
Collaborator

Ah it's not onnx but instead torchscript export. Still, there is a c++ API for that so it should be ok I think ?

@divyansh2681
Copy link
Author

divyansh2681 commented Aug 23, 2023

Ah it's not onnx but instead torchscript export. Still, there is a c++ API for that so it should be ok I think ?

You mean converting the preprocessor from torchscript to onnx and then using it? Or is it something else?

@divyansh2681
Copy link
Author

Okay, this makes sense, I'll check it out. Thank you!

@titu1994
Copy link
Collaborator

I meant using the torchscript c++ backend for the preprocessor and then onnx or TS backend for the model

@divyansh2681
Copy link
Author

divyansh2681 commented Aug 31, 2023

I used the torchscript c++ backend for the preprocessor and onnx backend for inference. The preprocessor outputs are signal data and sample rate. In 50% cases, the signal data has all 'NaN'. But the preprocessor runs perfectly when used with python.

I am loading audio using torchaudio in python and using AudioFile (https://github.com/adamstark/AudioFile) in c++.

I tried using torchaudio::sox::load_audio_file (https://github.com/pytorch/audio/blob/main/torchaudio/csrc/sox/io.cpp) but this gives 0's in the preprocessor output (when reading the signal data as a 2D vector, alternate rows have all 0 elements).

Is this error related to the preprocessor?

@titu1994
Copy link
Collaborator

Hmm dunno about that. From the tests in that PR, the processor gives exact same output as Torchaudio processor

@divyansh2681
Copy link
Author

Okay, I'll figure something out. Thank you!

@divyansh2681
Copy link
Author

I found the issue - the output of torchaudio.load/torchaudio::sox::load_audio_file is signal data, sample rate. I interpreted it as signal data, signal length. It works properly now. Thank you.

@titu1994 titu1994 closed this as completed Sep 1, 2023
@nabil6391
Copy link

@divyansh2681 Can you please suggest what files you needed converted to c++, or if you have any steps that would greatly help me. Thank you

@divyansh2681
Copy link
Author

Hey @nabil6391, NeMo has some python scripts to extract the preprocessor and vocabulary files for the model. Once you have the preprocessor and postprocessor, you can use ONNXruntime C++ library to run inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants