Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT EP memory leak fix #7415

Merged
merged 3 commits into from
Apr 23, 2021
Merged

TRT EP memory leak fix #7415

merged 3 commits into from
Apr 23, 2021

Conversation

chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Apr 22, 2021

Description: Fix memory leak.

It’s been detected by Address Sanitizer here:
Direct leak of 48 byte(s) in 1 object(s) allocated from:
0 0x7f9124743b in operator new(unsigned long) (/usr/lib/aarch64-linux-gnu/libasan.so.4+0xd243b)
1 0x7f63c5244b in createInferRuntime_INTERNAL (/usr/lib/aarch64-linux-gnu/libnvinfer.so.7+0x35a44b)
2 0x7f70d806a7 ()
3 0x7f70d86a4f ()
4 0x7f70ddd403 ()
5 0x7f70ddcfd3 ()
6 0x7f8c04bb7f in (anonymous namespace)::InitializeSession(OrtSessionOptions const*, std::unique_ptr<onnxruntime::InferenceSession, std::default_deleteonnxruntime::InferenceSession >&) (/home/onnxruntime/repos/onnxruntime/build/Linux/Debug/libonnxruntime.so.1.7.0+0xb1b7f)
7 0x7f8c04bf1f in OrtApis::CreateSession(OrtEnv const*, char const*, OrtSessionOptions const*, OrtSession**) (/home/onnxruntime/repos/onnxruntime/build/Linux/Debug/libonnxruntime.so.1.7.0+0xb1f1f)
8 0x5586028ccb in main /home/onnxruntime/repos/valgrind/onnxrt_trt_memsample/main.cpp:65
9 0x7f8bca86df in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x206df)
10 0x55860284a3 (/home/onnxruntime/repos/valgrind/onnxrt_trt_memsample/build/onnx_memtest+0x54a3)

as well as by valgrind here for the same leak:
==132002== 72 (48 direct, 24 indirect) bytes in 1 blocks are definitely lost in loss record 3,517 of 5,620
==132002== at 0x483CFE3: operator new(unsigned long) (vg_replace_malloc.c:417)
==132002== by 0x10C5E5BAB: createInferRuntime_INTERNAL (in /usr/lib/libnvinfer.so.7.1.3)
==132002== by 0xFF027C51: nvinfer1::(anonymous namespace)::createInferRuntime(nvinfer1::ILogger&) (NvInferRuntime.h:1906)
==132002== by 0xFF02A191: onnxruntime::TensorrtExecutionProvider::TensorrtExecutionProvider(onnxruntime::TensorrtExecutionProviderInfo const&) (tensorrt_execution_provider.cc:226)

Motivation and Context

  • Why is this change required? What problem does it solve?
  • If it fixes an open issue, please link to the issue here.

@chilo-ms chilo-ms requested review from stevenlix and jywu-msft April 22, 2021 17:04
@chilo-ms chilo-ms requested a review from a team as a code owner April 22, 2021 17:04
@jywu-msft
Copy link
Member

trt_state->runtime is a pointer to tensorrt_ptr::unique_pointernvinfer1::IRuntime, so we need to use get() to get the raw pointer of unique_ptr first and then access the object's member function.

Not quite sure whether there is a more concise way to write it.

that's weird. operator->() for unique_ptr is the same as get()
i.e. unique_ptr-> should be the same as unique_ptr.get()

anyway, my comment was also about not needing to store raw pointer in local variable.
you can still do trt_state->runtime.get()->deserializeCudaEngine()
and remove line 1298 (auto runtime = trt_state->runtime->get() )

@jywu-msft
Copy link
Member

jywu-msft commented Apr 22, 2021

trt_state->runtime is a pointer to tensorrt_ptr::unique_pointernvinfer1::IRuntime, so we need to use get() to get the raw pointer of unique_ptr first and then access the object's member function.
Not quite sure whether there is a more concise way to write it.

that's weird. operator->() for unique_ptr is the same as get()
i.e. unique_ptr-> should be the same as unique_ptr.get()

anyway, my comment was also about not needing to store raw pointer in local variable.
you can still do trt_state->runtime.get()->deserializeCudaEngine()
and remove line 1298 (auto runtime = trt_state->runtime->get() )

oh now I see why. runtime is a pointer to a unique_ptr , that is kind of strange to have pointer to a unique_ptr.

@@ -99,7 +99,7 @@ struct TensorrtFuncState {
std::string trt_node_name_with_precision;
bool engine_cache_enable;
std::string engine_cache_path;
nvinfer1::IRuntime* runtime = nullptr;
tensorrt_ptr::unique_pointer<nvinfer1::IRuntime>* runtime = nullptr;
Copy link
Member

@jywu-msft jywu-msft Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointer to unique_ptr is kind of strange.
would it be better to just keep it as a raw pointer as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in other comment.

@@ -1243,7 +1243,7 @@ common::Status TensorrtExecutionProvider::Compile(const std::vector<Node*>& fuse
&engines_[context->node_name], &contexts_[context->node_name], &builders_[context->node_name],
&networks_[context->node_name], input_info_[context->node_name], output_info_[context->node_name],
input_shape_ranges_[context->node_name], &tensorrt_mu_, &fp16_enable_, &int8_enable_, &max_workspace_size_,
trt_node_name_with_precision, engine_cache_enable_, cache_path_, runtime_, nullptr,
trt_node_name_with_precision, engine_cache_enable_, cache_path_, &runtime_, nullptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just pass the raw pointer here instead, instead of passing the address of the unique_ptr

Copy link
Contributor Author

@chilo-ms chilo-ms Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can pass the raw pointer as before. But we need to call runtime_.destroy() in destructor in order to release IRuntime object.

Previously, I'm thinking about passing unique_ptr, but it will have following compile error:
/home/onnxruntime/repos/onnxruntime/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:1247:89: error: use of deleted function ‘std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) [with _Tp = nvinfer1::IRuntime; _Dp = onnxruntime::tensorrt_ptr::TensorrtInferDeleter]’

The reason is that Copy constructor and Copy assignment of unique_ptr are deleted functions. So it will get error on line 1246. I think that's the reason that in struct TensorrtFuncState, we pass pointer to unique_ptr, such as parser, engine, context....

Copy link
Member

@jywu-msft jywu-msft Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime is already wrapped with a unique_ptr in TrtExecutionProvider right?
it will get freed when TrtExecutionProvider gets destroyed.
what you did, passing address of unique_ptr to the TensorrtFuncState is basically like passing in the underlying raw pointer, but in a more indirect way, right?

Copy link
Contributor Author

@chilo-ms chilo-ms Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing raw pointer is a solution, but it will lose code consistency since other members in struct TensorrtFuncState uses the defined unique_pointer which will handle destroy() automacially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. need to clean up some of this code.

@chilo-ms chilo-ms force-pushed the trt_ep_mem_leak_fix branch from 7b5333d to 48c86d5 Compare April 23, 2021 06:28
@jywu-msft jywu-msft merged commit f1c3f3f into master Apr 23, 2021
@jywu-msft jywu-msft deleted the trt_ep_mem_leak_fix branch April 23, 2021 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants