Triton terminated with Signal (6) #4566

erichtho · 2022-06-30T11:59:34Z

When using triton grpc client to infer, triton will exit unexpectedly sometimes.
like using:

with tritonclient.grpc.InferenceServerClient('localhost:8001', verbose = False) as client:
            outputs = [
                httpclient.InferRequestedOutput('logits', ),
                httpclient.InferRequestedOutput('embs', )
            ]

            # data_loader is a torch dataloader with 4 workers
            for sent_count, test_batch in enumerate(data_loader):
                with autocast():
                    processed_signal, processed_signal_length = preprocessor(
                        input_signal = test_batch[0].to(device),
                        length = test_batch[1].to(device)
                    )
                inputs = [
                    httpclient.InferInput("audio_signal", list(processed_signal.shape), "FP16"),
                    httpclient.InferInput("length", [1, 1], np_to_triton_dtype(np.int32))
                ]
                inputs[0].set_data_from_numpy(processed_signal.cpu().numpy().astype(np.float16), )
                inputs[1].set_data_from_numpy(processed_signal_length.cpu().numpy().astype(np.int32).reshape(1, 1))
                result = client.infer(model_name = "tensorrt_emb",
                                          inputs = inputs,
                                          outputs = outputs)

and tritonserver output：

terminate called after throwing an instance of 'nvinfer1::InternalError'
what(): Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed.
Signal (6) received.
0# 0x00005602FC4F21B9 in tritonserver
1# 0x00007FC98736C0C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
4# 0x00007FC987725911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007FC98773138C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007FC987730369 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# 0x00007FC98752BBEF in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
9# _Unwind_RaiseException in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# __cxa_throw in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# nvinfer1::Lobbernvinfer1::InternalError::operator()(char const*, char const*, int, int, nvinfer1::ErrorCode, char const*) in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
12# 0x00007FC9020EECBC in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
13# 0x00007FC902A7220F in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
14# 0x00007FC902A2862D in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
15# 0x00007FC902A7F653 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
16# 0x00007FC9020EE715 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
17# 0x00007FC901C8BAD0 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
18# 0x00007FC9020F41F4 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
19# 0x00007FC902913FD8 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
20# 0x00007FC90291478C in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
21# 0x00007FC97A57C6D7 in /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
22# 0x00007FC97A5855FE in /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
23# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
24# 0x00007FC987C1D73A in /opt/tritonserver/bin/../lib/libtritonserver.so
25# 0x00007FC987C1E0F7 in /opt/tritonserver/bin/../lib/libtritonserver.so
26# 0x00007FC987CDB411 in /opt/tritonserver/bin/../lib/libtritonserver.so
27# 0x00007FC987C175C7 in /opt/tritonserver/bin/../lib/libtritonserver.so
28# 0x00007FC98775DDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
29# 0x00007FC98896D609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
30# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

tritonserver version：22.05-py3(docker image)
using tensorrt backend.
os: ubuntu 20.04

How To Reproduce
We use trtexec to transform a onnx model to tensorRT engine(with maxShapes=1x80x12000), then put into triton model repository.
When send dozens of request with shape 1x80x11000(like 8000) and other model requests in same time(different grpc client in different process, not multiprocessing, but multiple .py running)，triton will exit by chance.

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2022-06-30T23:45:46Z

Hi @erichtho,

Thanks for reporting this issue.

Can you reproduce this with the HTTP client as well, or is it only reproducible with GRPC?
Can you share a complete client script, ONNX model, and example trtexec conversion command that we can use to easily reproduce this error as-is?

CC @GuanLuo @tanmayv25 if you've seen any TRT or similar backend issues like this before

erichtho · 2022-07-01T03:13:49Z

Sorry, I can't share my code, it's a part of big project. I'm try to simplify it, but can't reproduce the error with simplified code(still trying).
Onnx model is transformed from nemo titanet(titanet-l.nemo). I edit the graph, delete the if-branch in squeeze node, to make trtexec run normally. It maybe cause problem but the bug seems like didn't relate to input. Here is transformed onnx file (google drive).
trtexec command:
./trtexec --onnx=titanet-l_2.onnx --minShapes=audio_signal:1x80x1,length:1 --optShapes=audio_signal:1x80x12000,length:1 --maxShapes=audio_signal:1x80x12000,length:1 --fp16 --inputIOFormats=fp16:chw,int32:chw --saveEngine=model.plan --workspace=16400

And http client get broken pipe, so I can't tell if it would cause the bug.
BrokenPipeError: [Errno 32] Broken pipe

tanmayv25 · 2022-07-01T23:32:12Z

The back trace suggests that the error originates in within TensorRT. I don't think the issue is client specific.

When send dozens of request with shape 1x80x11000(like 8000) and other model requests in same time(different grpc client in different process, not multiprocessing, but multiple .py running)，triton will exit by chance.

I assume the issue only occurs in case of sufficient request concurrency? What is your instance count? Can you share your model configuration file?
Is your system running out of memory?

erichtho · 2022-07-06T05:00:11Z

Yes, it's related to request concurrency. And I feel like it appear with higher opportunity when there are lots of request with almost maximum shape.
I checked with top, dmesg, nvidia-smi, it's seems no memory issue, including CUDA memory (RTX3090) and system ram.
model configuration(report bug one):

name: "tensorrt_emb"
backend: "tensorrt"
max_batch_size: 1

input [
  {
    name: "audio_signal"
    data_type: TYPE_FP32
    dims: [80, -1]
  }
]
input [
  {
    name: "length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP16
    dims: [ -1 ]
  }
]
output [
  {
    name: "embs"
    data_type: TYPE_FP16
    dims: [ -1 ]
  }
]

instance_group [
  {
    count: 3
    kind: KIND_GPU
  }
]

dynamic_batching {
  preferred_batch_size: [1]
  max_queue_delay_microseconds: 1
  preserve_ordering: true
}

There are two other models in model repository, total instance count is 3.

By the way, we tried triton and onnx model. It's normal.

tanmayv25 · 2022-07-06T23:44:21Z

TensorRT team seems to have a fix that can resolve this issue. We are working with them to make the fix available to Triton users.

tanmayv25 added the bug Something isn't working label Jul 6, 2022

tanmayv25 mentioned this issue Aug 5, 2022

Triton server always crash during stress test. #4744

Closed

erichtho closed this as completed Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton terminated with Signal (6) #4566

Triton terminated with Signal (6) #4566

erichtho commented Jun 30, 2022 •

edited

Loading

rmccorm4 commented Jun 30, 2022

erichtho commented Jul 1, 2022 •

edited

Loading

tanmayv25 commented Jul 1, 2022

erichtho commented Jul 6, 2022

tanmayv25 commented Jul 6, 2022

Triton terminated with Signal (6) #4566

Triton terminated with Signal (6) #4566

Comments

erichtho commented Jun 30, 2022 • edited Loading

rmccorm4 commented Jun 30, 2022

erichtho commented Jul 1, 2022 • edited Loading

tanmayv25 commented Jul 1, 2022

erichtho commented Jul 6, 2022

tanmayv25 commented Jul 6, 2022

erichtho commented Jun 30, 2022 •

edited

Loading

erichtho commented Jul 1, 2022 •

edited

Loading