Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] InferenceSession with enable_profiling runs much slower #20238

Closed
MalekItani opened this issue Apr 9, 2024 · 1 comment
Closed

Comments

@MalekItani
Copy link

MalekItani commented Apr 9, 2024

Describe the issue

I have a network which runs fast enough for CPU deployment.
I would like to get somewhat accurate measurements of the runtime of individual kernel calls within the network. However, when I enable profiling in the session options, the runtime of the entire network increases significantly.

For example, if I run the code below with the provided model without enabling profiling, then the network runs on average in 810us. However, if I enable profiling, the average runtime of the overall network increases to 2330us. Of course, the exact numbers will differ between runs and platforms but the runtime increase is always persistent.

The increase in runtime also appears when I try to look solely at the sum of the durations of the individual kernel calls in the output JSON file. Obviously profiling should increase the runtime, but I wouldn't expect it to increase so drastically. I'm even a little hesitant to draw any conclusions from the traced output. Is there a better way to measure the runtime of individual kernel calls without introducing such a significant delay?

To reproduce

import onnxruntime as ort
import time
import numpy as np

RUNS = 100

# Without profiling
sess_options = ort.SessionOptions()
ort_sess = ort.InferenceSession('my_model.onnx', providers=['CPUExecutionProvider'], sess_options=sess_options)

inputs = {}

metadata = ort_sess.get_inputs()
for node in metadata:
    inputs[node.name] = np.zeros(node.shape).astype(np.float32)

# Warmup
ort_sess.run(None, inputs)

t1 = time.time()
for i in range(RUNS):
    output_list = ort_sess.run(None, inputs)
t2 = time.time()

t_no_prof = (t2 - t1) / RUNS

# With profiling
sess_options.enable_profiling = True
ort_sess = ort.InferenceSession('my_model.onnx', providers=['CPUExecutionProvider'], sess_options=sess_options)

# Warmup
ort_sess.run(None, inputs)

t1 = time.time()
for i in range(RUNS):
    output_list = ort_sess.run(None, inputs)
t2 = time.time()

t_prof = (t2 - t1) / RUNS


print(f"Time without profiling: {t_no_prof:.05f}s")
print(f"Time with profiling: {t_prof:.05f}s")

Urgency

No response

Platform

Linux

OS Version

6.5.0-15-generic #15~22.04.1-Ubuntu

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

my_model.onnx.zip

Is this a quantized model?

No

@skottmckay
Copy link
Contributor

This is completely expected. Mixing measuring individual kernels latency with overall latency doesn't quite make sense as a meaningful thing to do. The overhead of the former will always invalidate the latter.

Measure each separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants