performance is poor when onnxruntime C++ run in intel cpu #12489

allen20200111 · 2022-08-05T11:58:54Z

i have two onnxruntime session running at intel cpu :
(1) at first total time is 200ms,
(2) when test many times later, speed is 10s.
(3) when nothing to do several min later, speed is 200ms again.

why change so much, thanks!
(1) try multhread option
(2) try session_options.AddConfigEntry("session.set_denormal_as_zero", "1");

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centos 7
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: C++ 12.0
Python version: --
Visual Studio version (if applicable): ---
GCC/Compiler version (if compiling from source): --
CUDA/cuDNN version: --
GPU model and memory: ----

To Reproduce

(1) at first total time is 200ms,
(2) when test many times later, speed is 10s.
(3) when nothing to do several min later, speed is 200ms again.

Expected behavior
first and end time cost should same.

Screenshots
If applicable, add screenshots to help explain your problem.

yuslepukhin · 2022-08-05T18:13:38Z

Onnxruntime session would never have the first cold run exhibit the same performance. You would always need to have a couple of runs after the session is first created.

After you stop the activity, CPU caches grow cold, but recover quickly. Do you have a real time scenario where incoming requests depend on the user activity? We have work to do in this area, but originally Onnxruntime has been optimized for continuous processing so any suggestions would not provide desired results at this time.

A few things to try out depending on your model.

Since you are running on CPU, disable memory arena, it does not help in CPU scenarios.

  Ort::SessionOptions sessionOptions;
  sessionOptions.DisableCpuMemArena();

Play with the number of intra threads in session options and see what gives you the best performance, using sessionOptions.SetIntraOpNumThreads(options.IntraThreadCount);
Try overriding default allocator with MiMalloc. You can use LD_PRELOAD for a quick try.

allen20200111 · 2022-08-08T07:20:57Z

1 sessionOptions.DisableCpuMemArena();
2 sessionOptions.SetIntraOpNumThreads(options.IntraThreadCount)
3 LD_PRELOAD mimalloc

the three action I have try it, sorry that performance is same as before.

skottmckay · 2022-08-09T02:18:26Z

Please provide the full code to reproduce and show how you are measuring performance. As you say you have two onnxruntime sessions it's not clear how/when you are creating those sessions.

allen20200111 · 2022-08-10T03:32:37Z

In the initial cycle, the time consumption is relatively small, and the later time is very much, such as:
First serval loop(include step1 and step2) at main function toatl3 cost time is 200ms，but after 10-20s loop cost time is 10s , even more. It is found that the total1 or total2 consumes the most time.

#include <onnxruntime/core/session/experimental_onnxruntime_cxx_api.h>

thread_pool1 = std::make_unique<ThreadPool>(1)
thread_pool2 = std::make_unique<ThreadPool>(1)

session_options.AddConfigEntry("session.set_denormal_as_zero", "1");
session_options.DisableCpuMemArena();
session_options.SetIntraOpNumThreads(4);
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

session1 = new Ort::Experimental::Session(env, mode_path1, session_options);
session2 = new Ort::Experimental::Session(env, mode_path2, session_options);

std::vector<float> step1() {
	   auto task = [&, this] {
		Timer t1;
		auto ort_outputs = session1->Run(session1->GetInputNames(), input, output_names);
		Timer t2;
		toatl1 = t2-t1;
		cout << total use time << toatl1;
	}
        auto result = thread_pool1->enqueue(task);
	return result.get();
}

std::vector<float> step2() {
	  auto task = [&, this] {
		Timer t1;
		auto ort_outputs = session2->Run(session2->GetInputNames(), input, output_names);
		Timer t2;
		toatl2 = t2-t1;
		cout << total use time << toatl2;
	}
        auto result = thread_pool2->enqueue(task);
	return result.get();
}

main() {
      for (const auto &image : images) {
		Timer t1;
                                auto image1 = step1(image);
		auto ret1 = step2(image1);
		Timer t2;
		toatl3 = t2-t1;
		cout << total use time << toatl3;
    }

@skottmckay

skottmckay · 2022-08-10T03:53:19Z

It would be best to measure the ORT performance separately with no thread pools, and without the inline call to GetInputNames(). That way you're just measuring the cost of the Run and not all the other things going on.

Send one warmup query to each inference session, and measure performance for the following calls.

Also not clear what Timer is. Is that a high resolution timer or not? https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now would be preferable.

allen20200111 · 2022-08-10T03:53:32Z

In the long running time, the memory and CPU are not changed too much, which is basically the same as before.

allen20200111 · 2022-08-10T03:56:53Z

yes , i used is high_resolution_clock::now()

allen20200111 · 2022-11-14T09:42:34Z

i use the ORT performance separately with no thread pools ，the same time as before @skottmckay

allen20200111 · 2022-11-14T09:46:51Z

It would be best to measure the ORT performance separately with no thread pools, and without the inline call to GetInputNames(). That way you're just measuring the cost of the Run and not all the other things going on.

Send one warmup query to each inference session, and measure performance for the following calls.

Also not clear what Timer is. Is that a high resolution timer or not? https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now would be preferable.

reply:
In the long running time, the memory and CPU are not changed too much, which is basically the same as before.
i used is high_resolution_clock::now()
i use the ORT performance separately with no thread pools ，the same time as before

can you give me more suggestions? thank you!

allen20200111 · 2022-11-15T06:22:10Z

by gdb:
0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7.x86_64
(gdb) bt
#0 0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#1 0x00007efd417a05ce in onnxruntime::concurrency::ThreadPool::ParallelForFixedBlockSizeScheduling(long, long, std::function<void (long, long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#2 0x00007efd417a06a5 in onnxruntime::concurrency::ThreadPool::SimpleParallelFor(long, std::function<void (long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#3 0x00007efd417ef558 in MlasExecuteThreaded(void ()(void, int), void*, int, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#4 0x00007efd417b98fc in MlasNchwcConv(long const*, long const*, long const*, long const*, long const*, long const*, unsigned long, float const*, float const*, float const*, float*, MLAS_ACTIVATION const*, bool, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0

HectorSVC added the type:performance label Aug 5, 2022

sophies927 added core runtime issues related to core runtime and removed type:performance labels Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance is poor when onnxruntime C++ run in intel cpu #12489

performance is poor when onnxruntime C++ run in intel cpu #12489

allen20200111 commented Aug 5, 2022

yuslepukhin commented Aug 5, 2022

allen20200111 commented Aug 8, 2022

skottmckay commented Aug 9, 2022

allen20200111 commented Aug 10, 2022 •

edited

Loading

skottmckay commented Aug 10, 2022

allen20200111 commented Aug 10, 2022

allen20200111 commented Aug 10, 2022

allen20200111 commented Nov 14, 2022

allen20200111 commented Nov 14, 2022 •

edited

Loading

allen20200111 commented Nov 15, 2022

performance is poor when onnxruntime C++ run in intel cpu #12489

performance is poor when onnxruntime C++ run in intel cpu #12489

Comments

allen20200111 commented Aug 5, 2022

yuslepukhin commented Aug 5, 2022

allen20200111 commented Aug 8, 2022

skottmckay commented Aug 9, 2022

allen20200111 commented Aug 10, 2022 • edited Loading

skottmckay commented Aug 10, 2022

allen20200111 commented Aug 10, 2022

allen20200111 commented Aug 10, 2022

allen20200111 commented Nov 14, 2022

allen20200111 commented Nov 14, 2022 • edited Loading

allen20200111 commented Nov 15, 2022

allen20200111 commented Aug 10, 2022 •

edited

Loading

allen20200111 commented Nov 14, 2022 •

edited

Loading