Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance is poor when onnxruntime C++ run in intel cpu #12489

Open
allen20200111 opened this issue Aug 5, 2022 · 10 comments
Open

performance is poor when onnxruntime C++ run in intel cpu #12489

allen20200111 opened this issue Aug 5, 2022 · 10 comments
Labels
core runtime issues related to core runtime

Comments

@allen20200111
Copy link

i have two onnxruntime session running at intel cpu :
(1) at first total time is 200ms,
(2) when test many times later, speed is 10s.
(3) when nothing to do several min later, speed is 200ms again.

why change so much, thanks!
(1) try multhread option
(2) try session_options.AddConfigEntry("session.set_denormal_as_zero", "1");

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centos 7
  • ONNX Runtime installed from (source or binary): binary
  • ONNX Runtime version: C++ 12.0
  • Python version: --
  • Visual Studio version (if applicable): ---
  • GCC/Compiler version (if compiling from source): --
  • CUDA/cuDNN version: --
  • GPU model and memory: ----

To Reproduce

(1) at first total time is 200ms,
(2) when test many times later, speed is 10s.
(3) when nothing to do several min later, speed is 200ms again.

Expected behavior
first and end time cost should same.

Screenshots
If applicable, add screenshots to help explain your problem.

@yuslepukhin
Copy link
Member

Onnxruntime session would never have the first cold run exhibit the same performance. You would always need to have a couple of runs after the session is first created.

After you stop the activity, CPU caches grow cold, but recover quickly. Do you have a real time scenario where incoming requests depend on the user activity? We have work to do in this area, but originally Onnxruntime has been optimized for continuous processing so any suggestions would not provide desired results at this time.

A few things to try out depending on your model.

  • Since you are running on CPU, disable memory arena, it does not help in CPU scenarios.
  Ort::SessionOptions sessionOptions;
  sessionOptions.DisableCpuMemArena();
  • Play with the number of intra threads in session options and see what gives you the best performance, using sessionOptions.SetIntraOpNumThreads(options.IntraThreadCount);
  • Try overriding default allocator with MiMalloc. You can use LD_PRELOAD for a quick try.

@allen20200111
Copy link
Author

1 sessionOptions.DisableCpuMemArena();
2 sessionOptions.SetIntraOpNumThreads(options.IntraThreadCount)
3 LD_PRELOAD mimalloc

the three action I have try it, sorry that performance is same as before.

@skottmckay
Copy link
Contributor

Please provide the full code to reproduce and show how you are measuring performance. As you say you have two onnxruntime sessions it's not clear how/when you are creating those sessions.

@allen20200111
Copy link
Author

allen20200111 commented Aug 10, 2022

In the initial cycle, the time consumption is relatively small, and the later time is very much, such as:
First serval loop(include step1 and step2) at main function toatl3 cost time is 200ms,but after 10-20s loop cost time is 10s , even more. It is found that the total1 or total2 consumes the most time.

#include <onnxruntime/core/session/experimental_onnxruntime_cxx_api.h>

thread_pool1 = std::make_unique<ThreadPool>(1)
thread_pool2 = std::make_unique<ThreadPool>(1)

session_options.AddConfigEntry("session.set_denormal_as_zero", "1");
session_options.DisableCpuMemArena();
session_options.SetIntraOpNumThreads(4);
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

session1 = new Ort::Experimental::Session(env, mode_path1, session_options);
session2 = new Ort::Experimental::Session(env, mode_path2, session_options);

std::vector<float> step1() {
	   auto task = [&, this] {
		Timer t1;
		auto ort_outputs = session1->Run(session1->GetInputNames(), input, output_names);
		Timer t2;
		toatl1 = t2-t1;
		cout << total use time << toatl1;
	}
        auto result = thread_pool1->enqueue(task);
	return result.get();
}

std::vector<float> step2() {
	  auto task = [&, this] {
		Timer t1;
		auto ort_outputs = session2->Run(session2->GetInputNames(), input, output_names);
		Timer t2;
		toatl2 = t2-t1;
		cout << total use time << toatl2;
	}
        auto result = thread_pool2->enqueue(task);
	return result.get();
}

main() {
      for (const auto &image : images) {
		Timer t1;
                                auto image1 = step1(image);
		auto ret1 = step2(image1);
		Timer t2;
		toatl3 = t2-t1;
		cout << total use time << toatl3;
    }

@skottmckay

@skottmckay
Copy link
Contributor

It would be best to measure the ORT performance separately with no thread pools, and without the inline call to GetInputNames(). That way you're just measuring the cost of the Run and not all the other things going on.

Send one warmup query to each inference session, and measure performance for the following calls.

Also not clear what Timer is. Is that a high resolution timer or not? https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now would be preferable.

@allen20200111
Copy link
Author

In the long running time, the memory and CPU are not changed too much, which is basically the same as before.

@allen20200111
Copy link
Author

yes , i used is high_resolution_clock::now()

@sophies927 sophies927 added core runtime issues related to core runtime and removed type:performance labels Aug 12, 2022
@allen20200111
Copy link
Author

i use the ORT performance separately with no thread pools ,the same time as before @skottmckay

@allen20200111
Copy link
Author

allen20200111 commented Nov 14, 2022

It would be best to measure the ORT performance separately with no thread pools, and without the inline call to GetInputNames(). That way you're just measuring the cost of the Run and not all the other things going on.

Send one warmup query to each inference session, and measure performance for the following calls.

Also not clear what Timer is. Is that a high resolution timer or not? https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now would be preferable.

reply:
In the long running time, the memory and CPU are not changed too much, which is basically the same as before.
i used is high_resolution_clock::now()
i use the ORT performance separately with no thread pools ,the same time as before

can you give me more suggestions? thank you!

@allen20200111
Copy link
Author

by gdb:
0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7.x86_64
(gdb) bt
#0 0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#1 0x00007efd417a05ce in onnxruntime::concurrency::ThreadPool::ParallelForFixedBlockSizeScheduling(long, long, std::function<void (long, long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#2 0x00007efd417a06a5 in onnxruntime::concurrency::ThreadPool::SimpleParallelFor(long, std::function<void (long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#3 0x00007efd417ef558 in MlasExecuteThreaded(void ()(void, int), void*, int, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#4 0x00007efd417b98fc in MlasNchwcConv(long const*, long const*, long const*, long const*, long const*, long const*, unsigned long, float const*, float const*, float const*, float*, MLAS_ACTIVATION const*, bool, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

No branches or pull requests

5 participants