FusedConv Cuda EP invalid argument error. #12321

kiennguyen94 · 2022-07-26T15:56:07Z

Describe the bug
When running models with conv layers with optimization, ORT throws the following error

2022-07-26 09:49:04.435452395 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=1 ; hostname=tiger09.som.ma ; expr=cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data);
2022-07-26 09:49:04.435491739 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running FusedConv node. Name:'Conv_125' Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)
Traceback (most recent call last):
  File "final_repo.py", line 57, in <module>
    output = model.run(None, {"audio_signal": au_sig, "length": length}, run_options=run_opt)
  File "/n/w1-knguyen/conda3/install/envs/py3_cuda/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_125' Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)

Urgency
None

System information

OS Platform and Distribution: Centos 7
ONNX Runtime installed from: source
ONNX Runtime version: ddb45e9, also observed with 1.12
Python version: 3.7.13
Visual Studio version (if applicable): None
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: 11.3/8.2.1
GPU model and memory: 1080 TI - 12GB

To Reproduce

See https://github.com/kiennguyen94/ort_load_repro. Do tar xvzf ./citrinet.tgz && python ./final_repo.py
The issue has been reported before, eg Non-zero status code returned #11548 (with repro) or [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. #9194

Expected behavior

Expect no error.

Screenshots
None

Additional context

The code fails at

onnxruntime/onnxruntime/contrib_ops/cuda/fused_conv.cc

Line 87 in de57daa

CUDNN_RETURN_IF_ERROR(cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data,

with the error code CUDNN_STATUS_BAD_PARAM, which means either dimension of Z and Y don't match, or incompatible datatype. (per Cudnn docs)
It turns out that when Z is the output of some previous OP, it can have missing dimension, (eg Z is [1, 1024, 16] whereas Y is [1, 1024, 16, 1])
So the question is: do we want to support automatic dimension matching in the cuda/FusedConv? If so, I think simply adding a dimension check if len(Z.shape) == len(Y.shape) - 1: extend Z.shape by 1, then set ORT_RETURN_IF_ERROR(s_.z_tensor.Set(new_z_dim, CudnnTensor::GetDataType<CudaT>())); right around here

onnxruntime/onnxruntime/core/providers/cuda/nn/conv.cc

Line 231 in de57daa

ORT_RETURN_IF_ERROR(s_.x_tensor.Set(x_dims_cudnn, CudnnTensor::GetDataType<CudaT>()));
Or we can treat this as an export error, so the exporter must apply a Reshape on Z before FusedConv. But given that cpu/FusedConv works fine, I don't think we want this.
I'm happy to contribute, just need some guidance.

The text was updated successfully, but these errors were encountered:

RandySheriffH · 2022-07-26T16:26:45Z

@kiennguyen94: thanks for reporting this - we are on a similar issue #11548.
And yes the mismatch between Z and Y is the culprit.

RandySheriffH · 2022-07-26T20:40:53Z

@kiennguyen94 : BTW, for your case mind share us a model and some sample script to double confirm the fix working?

kiennguyen94 · 2022-07-28T13:51:24Z

@RandySheriffH Sorry for the late reply, here's the repro https://github.com/kiennguyen94/ort_load_repro.
Just need to extract the model tar tar xvzf ./citrinet.tgz then run python ./final_repo.py.

I have the fix in this temp draft PR #12366. This PR fixes this particular repro, but I'm not sure if it introduces unwanted behavior.

RandySheriffH · 2022-07-28T17:43:36Z

@kiennguyen94, thanks! For this issue your fix #12366 should work, plan to bring it in along with other fixes into https://github.com/microsoft/onnxruntime/tree/FuseConvShapeMismatch.

xfbs · 2024-02-27T08:37:41Z

For anyone else who runs into this issue: we were able to circumvent this by by turning off optimizations.

sess_opt = ort.SessionOptions()
sess_opt.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
return ort.InferenceSession(onnx_path, providers=providers, sess_options=sess_opt)

The theory being that some of the optimizations lead to a graph which cannot be executed using CUDA, while the ONNX file itself is fine.

RandySheriffH self-assigned this Jul 26, 2022

kiennguyen94 mentioned this issue Jul 28, 2022

fused conv dimension fix #12366

Draft

RandySheriffH mentioned this issue Aug 1, 2022

Fix shape-related issues in FuseConv #12410

Merged

kpedro88 mentioned this issue Sep 15, 2022

Extend onnxruntime gpu interface to producers using onnxruntime cms-sw/cmssw#39402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FusedConv Cuda EP invalid argument error. #12321

FusedConv Cuda EP invalid argument error. #12321

kiennguyen94 commented Jul 26, 2022 •

edited

Loading

RandySheriffH commented Jul 26, 2022

RandySheriffH commented Jul 26, 2022

kiennguyen94 commented Jul 28, 2022 •

edited

Loading

RandySheriffH commented Jul 28, 2022

xfbs commented Feb 27, 2024

FusedConv Cuda EP invalid argument error. #12321

FusedConv Cuda EP invalid argument error. #12321

Comments

kiennguyen94 commented Jul 26, 2022 • edited Loading

RandySheriffH commented Jul 26, 2022

RandySheriffH commented Jul 26, 2022

kiennguyen94 commented Jul 28, 2022 • edited Loading

RandySheriffH commented Jul 28, 2022

xfbs commented Feb 27, 2024

kiennguyen94 commented Jul 26, 2022 •

edited

Loading

kiennguyen94 commented Jul 28, 2022 •

edited

Loading