-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] 1.14RC1 Tensorrt Regression #14484
Comments
If possible, can you give us steps to reproduce the performance you're seeing? |
Hi, I am sorry but I cannot provide the models and I am executing using the C api via Rust bindings so a bit difficult to share. I have been running profiling (which I can provide) to understand where the bottleneck is and have found some interesting things. As I mentioned above I have three onnxsessions running in parallel (overlapping) that:
Individually each model's 50th percentile is faster on 1.14.0 vs 1.13.1 yet some sort of scheduling delay is causing the end-to-end time to increase. The speedup of ~2% can be measured for 1.14.0 vs 1.13.1 when executing a single model.
FYI SQL query for calculation (skipping initial 100 runs to allow for warmup): WITH p AS (SELECT dur, NTILE(10) OVER (ORDER BY dur) AS percentile
FROM slice
WHERE name = 'model_run'
AND id > 100)
SELECT percentile, MAX(dur) as dur
FROM p
GROUP BY percentile; I will continue investigating. |
Another update. I have run NVidia NSight profiling while running 1.13.1 and 1.14.0. The biggest change I can see is that the Memcpy HtoD is running in parallel on 1.13.1 and sequentially on 1.14.0. Also that Memcpy is running on an arbitrary stream on 1.13.1 the Default stream in 1.14.0. I have timed the two parts that were previously running in parallel (the six red parts and the green bit) and it sums up to ~2ms which aligns perfectly with my initial findings. |
I wanted to confirm all the variables here and ask some follow-up questions. I assume the application code to run both ORT versions is exactly the same? (no change in api usage?) |
Hi. I will push my updated Rust bindings soon and can link here. I am executing this per frame of a video so the parallelism seen must be for the next frame. For clarity I have three independent sessions running in parallel in their own threads passing the ortvalue of the iobinding to the next model. We are doing nothing to explicitly set up any cuda streams. It looks to me like the cuda option do_copy_in_default_stream is active now? I guess I can try to repro a minimal example but quite hard. |
I don't believe there any changes in TensorRT Execution Provider between 1.13 and 1.14 that could have resulted in this behavior. |
Hi. I have built both f4cd35f and a81faee which is the commit directly before and after the commit of #13495. I think these results conclusively identify #13495 as the cause of the performance regression. BEFORE #13495:
AFTER #13495:
For clarity these are the versions involved:
|
Thanks a lot for running these experiments and helping narrow it down! |
+@souptc FYI |
hi @seddonm1, could you please share how did you copy the input to GPU? based on my understanding, in onnxruntime, we never put the input memory copy on separate stream. not suer is there any advance feature that we missed during the refactoring. |
from the profiling you shared, it seems in ORT-1.13, the MemCpy HtoD is put on a stream with other ort kernels. I'd like to confirm following thigs:
|
Hi @souptc.
I have exported more detailed views of the Nsight profile showing the individual threads. ~1.13.1 (f4cd35f) build:The streams are:
~1.14.0 (a81faee)In this one the streams are:
I guess the big questions are:
|
the model execution in step 2 (memcpy HtoD + a small kernels) is not on default stream, you can see in your ort 1.14 profiling, the small kernel is executed on stream 16. It is just the memcpy been put on default stream. From the implementation, we only put the graph input copy on default stream for two cases: but i don't see how could your case following into any of those two categories. We will try to reproduce what happened with iobinding, but could you share what you did in that small kernel? |
Thanks @souptc - you are better at interpreting these Nsight reports than I am. The operation that you are referring to on stream 16 after the memcpy is a Scale operation. Step 2 does three things:
|
thanks @seddonm1 . Could you help me for two more experiments:
|
Yes, you can see the issue happens with just this one model. I am using the C API via Rust bindings but not doing anything exciting. I can post my bindings tomorrow but I am creating an OrtEnvironment with global inter/intra op threads 16, then an OrtSession with execution_mode parallel, graph_optimization all, mem_pattern true, disable_per_session_threads. Do you have an email I can send you the model? I will need to verify with my employer before I can send. In the meantime I can easily add some |
sure, could you mail to [email protected]? not only model, i am more interested in how you configure the session and iobinding, probably some usage i previously missed. do you mind to share that script together? if you want to debug it locally, i would suggest to check here, what is the stream and the "stream->handle" you got underline. if it is on default stream, you will get a NULL here. |
Hi @souptc I have sent you the ONNX model to your email address and aim to do some local debugging today. |
Any updates ? We have the same problem. |
Hi,
My results show that the #14719 PR does help reduce the regression but runtime vs baseline Are these results inline with your testing @noujaimc ? |
@seddonm1 Thanks for the testing result. Is the memcpy op still launching on the default stream? |
Are you testing on the small model shared with us or the original model? @seddonm1 |
### Description Create new stream for data copy for IOBidning input scenario ### Motivation and Context Previously in bindInput(), a nullptr Stream is passed to copy data cross device. This caused the default stream is used thus hurt the performance. This PR is to fix #14484 --------- Co-authored-by: Lei Cao <[email protected]>
@jslhcl Here is and updated profile that does show the memcpy HtoD is no longer on the default stream (#14719): |
Thanks for the result. I didn't close the issue manually, it is automatically closed by the PR. Let me talk to Cheng tomorrow on this issue. |
@seddonm1 do you mind sharing the nsys profiling reports on both feature branch and main branch? Thank you very much! |
### Description Create new stream for data copy for IOBidning input scenario ### Motivation and Context Previously in bindInput(), a nullptr Stream is passed to copy data cross device. This caused the default stream is used thus hurt the performance. This PR is to fix #14484 --------- Co-authored-by: Lei Cao <[email protected]>
### Description Create new stream for data copy for IOBidning input scenario ### Motivation and Context Previously in bindInput(), a nullptr Stream is passed to copy data cross device. This caused the default stream is used thus hurt the performance. This PR is to fix #14484 --------- Co-authored-by: Lei Cao <[email protected]>
@jslhcl I have sent the profiles for 1.13.1, 1.14.0 and 1.14.x to you via email. If anyone reading is able to provide a reproducible example using public models that would really help. |
It should be fixed in the latest main branch code, that the created stream will work concurrently with the default stream without any explicit synchronization |
Hi, Thanks everyone 👍 |
Describe the issue
I have been testing 1.14.0RC1 and am seeing quite a significant performance regression vs 1.13.1 using C api. You can see the use of the GPU is lower (both wattage and volatile ram) suggesting some bottlenecking has been introduced.
To reproduce
This process is doing three things in parallel:
Urgency
No response
Platform
Linux
OS Version
Ubuntu 20.04.5LTS
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
rel-1.14.0
ONNX Runtime API
C
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
CUDA 12.0
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: