[Performance] Python inference runs faster than C++ #22328
Labels
performance
issues related to performance regressions
platform:jetson
issues related to the NVIDIA Jetson platform
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
Observe that ONNX model (FP32) executed in C++ runs slower than Python. it's much much worse with TensorRT execution provider.
I've tried exporting a F16 model with
keep_io_types
set toTrue
. That isn't better either, infact worse.But first want to talk about the F32 model in question.
To reproduce
Python script:
C++:
Both Python and C++ run the same model, with inference being run 1000 times in C++ just as in Python.
I suspect it's the large input copying between CPU <-> GPU that's slowing things down. I hate the fact that I am initializing
Ort::Value input_tensor
every single time. Can someone help me with binding I/O? I searched quite extensively online, and there's quite a few examples for Python and the ones I run into for C++ either don't compile or segfaults or gives me 0 detections. The output node is a dynamic node.. shape/size is not pre-determined.Here are a few questions that I have in particular:
for some context, all of the runtimes provided are on Jetson ORIN, being run from inside a docker container.
Urgency
No response
Platform
Linux
OS Version
22.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.20.0, commit id: e91ff94
ONNX Runtime API
Python
Architecture
ARM64
Execution Provider
Default CPU, CUDA, TensorRT
Execution Provider Library Version
CUDA Version: 12.2
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: