Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion bound >= 0 failed of TensorRT 8.6.1 when running build_serialized_network on GPU nvidia tesla v100 #3639

Closed
elch10 opened this issue Jan 29, 2024 · 30 comments
Assignees
Labels
internal-bug-tracked Tracked internally, will be fixed in a future release. triaged Issue has been triaged by maintainers

Comments

@elch10
Copy link

elch10 commented Jan 29, 2024

Description

I try to convert small modification of VITS model https://github.com/jaywalnut310/vits. But getting error when running builder.build_serialized_network:

[01/29/2024-13:52:34] [TRT] [I] Graph optimization time: 0.629615 seconds.
[01/29/2024-13:52:34] [TRT] [W] BuilderFlag::kENABLE_TACTIC_HEURISTIC has been ignored in this builder run. This feature is only supported on Ampere and beyond.
[01/29/2024-13:52:34] [TRT] [V] Building graph using backend strategy 0
[01/29/2024-13:52:34] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[01/29/2024-13:52:34] [TRT] [V] Constructing optimization profile number 0 [1/1].
[01/29/2024-13:52:34] [TRT] [E] 2: Assertion bound >= 0 failed. 
[01/29/2024-13:52:34] [TRT] [E] 2: [shapeContext.cpp::checkVolume::2923] Error Code 2: Internal Error (Assertion bound >= 0 failed. )

Environment

TensorRT Version: 8.6.1

NVIDIA GPU: Nvidia Tesla v100

NVIDIA Driver Version: 450.216.04

CUDA Version: 11.6

CUDNN Version: 8.9

Operating System: Ubuntu 22.04.3 inside Docker Container

Python Version (if applicable): 3.11

PyTorch Version (if applicable): 1.13.1

Steps To Reproduce

Have you tried the latest release?: yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): yes

@elch10
Copy link
Author

elch10 commented Jan 29, 2024

What does that error mean? How can I debug this?

@zerollzeng
Copy link
Collaborator

Does it work with onnxruntime? you can check it quickly with polygraphy run model.onnx --onnxrt, if yes then could you please provide a reproduce? Thanks!

@zerollzeng zerollzeng self-assigned this Jan 30, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jan 30, 2024
@elch10
Copy link
Author

elch10 commented Jan 30, 2024

Of course, It works with onnxruntime and polygraphy. Polygraphy output

[I] RUNNING | Command: /home/user/conda/envs/ekerimov-convert/bin/polygraphy run onnx_500k/generator.onnx --onnxrt
[I] onnxrt-runner-N0-01/29/24-15:45:19  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[W] Input tensor: text_emb [shape=BoundedShape(['batch_axis', 'text_axis', 192], min=None, max=None)] | Will generate data of shape: [1, 1, 192].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: q_labels [shape=BoundedShape(['batch_axis', 'text_axis', 5], min=None, max=None)] | Will generate data of shape: [1, 1, 5].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: bert_emb [shape=BoundedShape(['batch_axis', 'token_axis', 768], min=None, max=None)] | Will generate data of shape: [1, 1, 768].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: speaker_ids [shape=BoundedShape(['batch_axis'], min=None, max=None)] | Will generate data of shape: [1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: length_scale [shape=BoundedShape(['batch_axis', 'text_axis'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: noise_scale [shape=BoundedShape(['batch_axis'], min=None, max=None)] | Will generate data of shape: [1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: noise_scale_w [shape=BoundedShape(['batch_axis'], min=None, max=None)] | Will generate data of shape: [1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-01/29/24-15:45:19 
    ---- Inference Input(s) ----
    {text_emb [dtype=float32, shape=(1, 1, 192)],
     q_labels [dtype=int64, shape=(1, 1, 5)],
     bert_emb [dtype=float32, shape=(1, 1, 768)],
     speaker_ids [dtype=int64, shape=(1,)],
     length_scale [dtype=float32, shape=(1, 1)],
     noise_scale [dtype=float32, shape=(1,)],
     noise_scale_w [dtype=float32, shape=(1,)]}
[I] onnxrt-runner-N0-01/29/24-15:45:19 
    ---- Inference Output(s) ----
    {wav [dtype=float32, shape=(1, 1, 1024)],
     attn [dtype=float32, shape=(1, 4, 1)]}
[I] onnxrt-runner-N0-01/29/24-15:45:19  | Completed 1 iteration(s) in 67.58 ms | Average inference time: 67.58 ms.
[I] PASSED | Runtime: 3.193s | Command: /home/user/conda/envs/ekerimov-convert/bin/polygraphy run onnx_500k/generator.onnx --onnxrt

@zerollzeng
Copy link
Collaborator

could you please provide a reproduce? Thanks!

@zerollzeng
Copy link
Collaborator

Would be great if you can try TRT 9.2/9.3 first.

@elch10
Copy link
Author

elch10 commented Feb 2, 2024

Is there python wheel with trt 9.2/9.3 or I need trtexec?

@zerollzeng
Copy link
Collaborator

python wheel should be shipped with the tar package.

@elch10
Copy link
Author

elch10 commented Feb 7, 2024

I couldn't find wheel in tar package of current repo. But I found in such archives https://developer.nvidia.com/nvidia-tensorrt-8x-download , but there is also version 8.6.1

I uploaded onnx model to reproduce https://drive.google.com/file/d/1nlXTliLV9M7_Z1xiQnUXYP_p8UqbEUBk/view?usp=sharing

@elch10
Copy link
Author

elch10 commented Feb 7, 2024

And use such code

# %%
import tensorrt as trt
import onnx

logger = trt.Logger(trt.Logger.VERBOSE)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)


# %%
success = parser.parse_from_file('generator.onnx')
for idx in range(parser.num_errors):
    err = parser.get_error(idx)
    print(err)

if not success:
    exit(0)

# %%
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1024 * 1024 * 1024)
config.flags |= 1 << int(trt.BuilderFlag.DEBUG)
config.clear_flag(trt.BuilderFlag.TF32)



MIN_TIME_AXIS = 1
MAX_TIME_AXIS = 400

MIN_TIME_AXIS_BERT = 1
MAX_TIME_AXIS_BERT = 50

# test input
TEST_TIME_AXIS = 400
TEST_TIME_AXIS_BERT = 50

TEXT_EMB_SIZE = 192
N_Q_FEATURES = 5
BERT_EMB_DIM = 768


dynamic_shape_config = [
    {"input": "text_emb", "min": (1, MIN_TIME_AXIS, TEXT_EMB_SIZE), "opt": (1, MAX_TIME_AXIS, TEXT_EMB_SIZE), "max": (1, MAX_TIME_AXIS, TEXT_EMB_SIZE)},
    {"input": "q_labels", "min": (1, MIN_TIME_AXIS, N_Q_FEATURES), "opt": (1, MAX_TIME_AXIS, N_Q_FEATURES), "max": (1, MAX_TIME_AXIS, N_Q_FEATURES)},
    {"input": "bert_emb", "min": (1, MIN_TIME_AXIS_BERT, BERT_EMB_DIM), "opt": (1, MAX_TIME_AXIS_BERT, BERT_EMB_DIM), "max": (1, MAX_TIME_AXIS_BERT, BERT_EMB_DIM)},
    {"input": 'speaker_ids', "min": (1,), "opt": (1,), "max": (1,)},
    {"input": 'noise_scale', "min": (1,), "opt": (1,), "max": (1,)},
    {"input": 'noise_scale_w', "min": (1,), "opt": (1,), "max": (1,)},
    {"input": 'length_scale', "min": (1, MIN_TIME_AXIS,), "opt": (1, MAX_TIME_AXIS,), "max": (1, MAX_TIME_AXIS,)},
]

profile = builder.create_optimization_profile()
for s in dynamic_shape_config:
    profile.set_shape(**s)

config.add_optimization_profile(profile)
# config.builder_optimization_level = 0


ser_engine = builder.build_serialized_network(network, config)
with open('generator.trt', 'wb') as f:
    f.write(ser_engine)

@elch10
Copy link
Author

elch10 commented Feb 7, 2024

I found that the error is due to this line https://github.com/jaywalnut310/vits/blob/main/models.py#L517.
or rather because of attn.squeeze(). But due squeeze doesn't work #2846 I used just attn = attn[:, 0] and then matmul.
And trt raises error due attn[:, 0]. If I comment this line and all calls after, convertation works ok.
Shape of attn is (batch_size, 1, t_1, t_2)

@elch10
Copy link
Author

elch10 commented Feb 8, 2024

I also tried trtexec of version 8.6, 7.x and the same error occurs

@zerollzeng
Copy link
Collaborator

Test with TRT 9.2:

[02/19/2024-08:07:17] [W] [TRT] /dp/flows.7/Reshape_14: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.7/Reshape_16: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.7/Reshape_20: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.7/Reshape_22: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.7/Reshape_24: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.7/Reshape_26: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.5/Reshape_14: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.5/Reshape_16: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.5/Reshape_20: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.5/Reshape_22: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.5/Reshape_24: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.5/Reshape_26: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.3/Reshape_14: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.3/Reshape_16: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.3/Reshape_20: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.3/Reshape_22: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.3/Reshape_24: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [W] [TRT] /dp/flows.3/Reshape_26: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/19/2024-08:07:17] [E] Error[4]: [fillNode.cpp::symbolicExecute::109] Error Code 4: Internal Error (/dp/RandomNormalLike: an IFillLayer can compute a shape tensor only for FillOperation::kLINSPACE.)
[02/19/2024-08:07:17] [E] Engine could not be created from network
[02/19/2024-08:07:17] [E] Building engine failed
[02/19/2024-08:07:17] [E] Failed to create engine from model or file.
[02/19/2024-08:07:17] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v9200] # trtexec --onnx=generator.onnx

Looks like we hit a known limitation, what is the real input shape?

@elch10
Copy link
Author

elch10 commented Feb 19, 2024

I run with such command
/usr/src/tensorrt/bin/trtexec --onnx=generator.onnx --minShapes=text_emb:1x1x192,q_labels:1x1x5,bert_emb:1x1x768,speaker_ids:1,noise_scale:1,noise_scale_w:1,length_scale:1x1 --optShapes=text_emb:1x400x192,q_labels:1x400x5,bert_emb:1x50x768,speaker_ids:1,noise_scale:1,noise_scale_w:1,length_scale:1x400 --maxShapes=text_emb:1x400x192,q_labels:1x400x5,bert_emb:1x50x768,speaker_ids:1,noise_scale:1,noise_scale_w:1,length_scale:1x400 --workspace=30000

@elch10
Copy link
Author

elch10 commented Feb 19, 2024

I saw somewhere about RandomNormalLike, but as I remember solution was just update tensorrt

@elch10
Copy link
Author

elch10 commented Feb 27, 2024

Any updates?
I've encountered similar issue using TRT 9.2.0.5. It's also about StochasticDurationPredictor module (https://github.com/jaywalnut310/vits/blob/main/models.py#L17), as in your output with RandomNormalLike

[02/27/2024-12:21:40] [W] [TRT] /dp/flows.3/Reshape_26: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[02/27/2024-12:21:41] [E] Error[4]: [fillNode.cpp::symbolicExecute::112] Error Code 4: Internal Error (/dp/flows.7/Range: An IFillLayer that computes a shape tensor can have at most one input, and the input must be the first input.)
[02/27/2024-12:21:41] [E] Engine could not be created from network
[02/27/2024-12:21:41] [E] Building engine failed
[02/27/2024-12:21:41] [E] Failed to create engine from model or file.
[02/27/2024-12:21:41] [E] Engine set up failed

@zerollzeng
Copy link
Collaborator

Filed internal bug 4535894 for this.

@zerollzeng zerollzeng added the internal-bug-tracked Tracked internally, will be fixed in a future release. label Feb 28, 2024
@ArchRobison
Copy link

Just an aside: I noticed the network is using what TensorRT calls "zero as placeholder", which indicates the original ONNX file is not setting the attribute "allowzero=1" for Reshape.

When "allowzero=1" is not present, ONNX treats a 0 in a reshape dimension not as a dimension, but as a placeholder for the corresponding input dimension. With dynamic shapes this is almost never what the author intended, and tends to break networks.

Attached is a zip file with a python script that I sometimes use to repair networks where the author did not intend 0 to be a placeholder.

allowzero.zip

@elch10
Copy link
Author

elch10 commented Mar 5, 2024

It doesn't help. I got the same error

[03/05/2024-10:41:25] [TRT] [I] Graph optimization time: 0.513168 seconds.
[03/05/2024-10:41:25] [TRT] [W] BuilderFlag::kENABLE_TACTIC_HEURISTIC has been ignored in this builder run. This feature is only supported on Ampere and beyond.
[03/05/2024-10:41:25] [TRT] [V] Building graph using backend strategy 0
[03/05/2024-10:41:25] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[03/05/2024-10:41:25] [TRT] [V] Constructing optimization profile number 0 [1/1].
[03/05/2024-10:41:25] [TRT] [E] 2: Assertion bound >= 0 failed. 
[03/05/2024-10:41:25] [TRT] [E] 2: [shapeContext.cpp::checkVolume::2923] Error Code 2: Internal Error (Assertion bound >= 0 failed. )

@elch10
Copy link
Author

elch10 commented Mar 5, 2024

May be it will help:
If I split this module to two modules by this line https://github.com/jaywalnut310/vits/blob/main/models.py#L515.
i.e. first module have code https://github.com/jaywalnut310/vits/blob/main/models.py#L501-L514
And second https://github.com/jaywalnut310/vits/blob/main/models.py#L515-L522
Then two modules converted without any errors. And then I can run one by one sequentially.
But when two modules "inside one big module" the above error occurs.

@ArchRobison
Copy link

There is an error in TensorRT that affects attempts to use IFillLayer with mode kRANDOM_UNIFORM or kRANDOM_NORMAL to construct a shape tensor. The mistake in TensorRT was that one part of the logic incorrectly claimed "I can deliver a shape tensor" and the other part later said "That's not allowed."

The FillLayers are coming from layers /RandomNormalLike and "/dp/RandomNormalLike". The first one's output has variable dimensions, which knocks it out from consideration as a shape tensor, so I think it's /dp/RandomNormalLike_output_0 that is triggering the bug.

The following hack might work. When the output from IConvolutionLayer is used as a shape tensor, TensorRT correctly deals with it, even though the layer says "I can't deliver a shape tensor". The hack is to feed the output from the IFillLayer through dummy 1x1 IConvolutionLayer that is just an identity operation, i.e. the weights are an identity matrix, TensorRT should be able to deal with it, because the convolution will stop TensorRT from asking IFillLayer to deliver a shape tensor. A complication is that IConvolutionLayer needs 4D input, so you'll need to add some reshaping to compensate.

So at the TensorRT level, the replacement for the IFillLayer looks some like:

IFillLayer --> IShuffleLayer --> IConvolutionLayer --> IShuffleLayer -->

where the first IShuffleLayer does a 3D to 4D reshape and the second IShuffleLayer does a 4D to 3D reshape. E.g., first shuffle can reshape from [1,2,1] to [1,1,2,1] and second shuffle can reshape the other direction. The convolution sees a channel-dimension of length 1, so the identity matrix is just a 1x1 matrix containing 1.

Of course what I've described is at the TensorRT level. You're probably more interested in an ONNX-level description. At the ONNX level, the hack looks like replacing RandomNormalLike /dp/RandomNormalLike with:

RandomNormalLike --> Reshape --> Conv --> Reshape -->

@elch10
Copy link
Author

elch10 commented Mar 26, 2024

RandomNormalLike was from here https://github.com/jaywalnut310/vits/blob/main/models.py#L90
I replaced that line with

      z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) # (b, 2, t)
      z = z.unsqueeze(1) # (b, 1, 2, t)
      z = F.conv2d(z, z.new_ones(1, 1, 1, 1)) # identity
      z = z[:, 0] # (b, 2, t)

      z = z * noise_scale

And it seems to work. Will such a solution be added inside tensorrt?

I'm testing now, if another errors will occur I let you know

@zerollzeng
Copy link
Collaborator

Hi, this issue cannot be fixed in short-term, and it's still under tracked, to unblock you, we prepare a WAR, could you please try on you side?

WAR:

  1. upgrade to TRT 10.0
  2. Add a Cast operation converting FP32 to INT64 before the /Clip operation, as shown in the following figure
    image

@ttyio
Copy link
Collaborator

ttyio commented Jul 2, 2024

closing since there is WAR, thanks all!

@ttyio ttyio closed this as completed Jul 2, 2024
@jingzhaoo
Copy link

I ran into the same issue today. @zerollzeng, thanks a lot for the WAR. Any instructions how to add that CAST operation? Should I make some changes to the original model? I am eager to try it out.

@clumsyroot
Copy link

clumsyroot commented Oct 10, 2024

[TRT] [E] [shapeContext.cpp::checkVolume::3570] Error Code 2: Internal Error (Assertion bound >= 0 failed. )
same issue +1 (vits) , looking forward to some progress
and when i trying to solve it by the WAR, encountered the flowing errors:

[10/10/2024-12:42:46] [TRT] [W] IElementWiseLayer with inputs /ReduceSum_output_0_casted and ONNXTRT_Broadcast_12929_output: first input has type Int64 but second input has type Float.
[10/10/2024-12:42:46] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:949: While parsing node number 5873 [Clip -> "/Clip_output_0"]:
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:950: --- Begin node ---
input: "/ReduceSum_output_0_casted"
input: "/Cast_output_0"
input: ""
output: "/Clip_output_0"
name: "/Clip"
op_type: "Clip"

[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:951: --- End node ---
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:954: ERROR: ModelImporter.cpp:195 In function parseNode:
[6] Invalid Node - /Clip
ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)
In node 5873 with name: /Clip and operator: Clip (parseNode): INVALID_NODE: Invalid Node - /Clip
ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)

Any advice? Thanks a lot. @zerollzeng

@clumsyroot
Copy link

[TRT] [E] [shapeContext.cpp::checkVolume::3570] Error Code 2: Internal Error (Assertion bound >= 0 failed. ) same issue +1 (vits) , looking forward to some progress and when i trying to solve it by the WAR, encountered the flowing errors:

[10/10/2024-12:42:46] [TRT] [W] IElementWiseLayer with inputs /ReduceSum_output_0_casted and ONNXTRT_Broadcast_12929_output: first input has type Int64 but second input has type Float.
[10/10/2024-12:42:46] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:949: While parsing node number 5873 [Clip -> "/Clip_output_0"]:
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:950: --- Begin node ---
input: "/ReduceSum_output_0_casted"
input: "/Cast_output_0"
input: ""
output: "/Clip_output_0"
name: "/Clip"
op_type: "Clip"

[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:951: --- End node ---
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:954: ERROR: ModelImporter.cpp:195 In function parseNode:
[6] Invalid Node - /Clip
ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)
In node 5873 with name: /Clip and operator: Clip (parseNode): INVALID_NODE: Invalid Node - /Clip
ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)

Any advice? Thanks a lot. @zerollzeng

ok, guys, after some debugging operations, I found that the error was caused by the following line of code:
https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L512
In my use case, I don't need to perform batch inference. so I commented out this line of code and set the mask to all ones, which solved the problem. perhaps can try using other methods to achieve the operation performed by this line of code. It's worth mentioning that I'm still curious about why this error occurs in TensorRT.🤔

@jingzhaoo
Copy link

I am still blocked by this issue and wonder if anyone can help me out. I do need batch inference to achieve better performance. Regarding the WAR mentioned above, I can see the Cast node is already there (see below) and I got the same error when converting ONNX to TensorRT. Any suggestion would be highly appreciated.

Image

@jingzhaoo
Copy link

@zerollzeng Possible to reopen this ticket so that we can take a closer look at it? Thanks.

@jingzhaoo
Copy link

[TRT] [E] [shapeContext.cpp::checkVolume::3570] Error Code 2: Internal Error (Assertion bound >= 0 failed. ) same issue +1 (vits) , looking forward to some progress and when i trying to solve it by the WAR, encountered the flowing errors:

[10/10/2024-12:42:46] [TRT] [W] IElementWiseLayer with inputs /ReduceSum_output_0_casted and ONNXTRT_Broadcast_12929_output: first input has type Int64 but second input has type Float.
[10/10/2024-12:42:46] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:949: While parsing node number 5873 [Clip -> "/Clip_output_0"]:
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:950: --- Begin node ---
input: "/ReduceSum_output_0_casted"
input: "/Cast_output_0"
input: ""
output: "/Clip_output_0"
name: "/Clip"
op_type: "Clip"

[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:951: --- End node ---
[10/10/2024-12:42:46] [TRT] [E] ModelImporter.cpp:954: ERROR: ModelImporter.cpp:195 In function parseNode:
[6] Invalid Node - /Clip
ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)
In node 5873 with name: /Clip and operator: Clip (parseNode): INVALID_NODE: Invalid Node - /Clip
ITensor::getDimensions: Error Code 4: API Usage Error ((Unnamed Layer* 13578) [ElementWise]: IElementWiseLayer with MAX operation has incompatible input types Int64 and Float type.)

Any advice? Thanks a lot. @zerollzeng

I added a Cast node as suggested in the WAR as shown below and then encountered the same error in the Clip node. Appreciate your help, @zerollzeng.

Image

@jingzhaoo
Copy link

jingzhaoo commented Nov 26, 2024

I resolved the Invalid Node - /Clip error after adding the extra Cast operator. The following Clip operator has three inputs. After casting the input' input to int64, we also need to cast minandmaxinputs to int64. However, I still ran into the original[Assertion bound >= 0]` error during TensorRT conversion. So, the WAR actually does not work. I appreciate some more help on this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal-bug-tracked Tracked internally, will be fixed in a future release. triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants