Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'ValueError: only one element tensors can be converted to Python scalars' failure of TensorRT 8.2 when running NVIDIA DALI on GPU V100 #3597

Closed
juinshell opened this issue Jan 13, 2024 · 2 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@juinshell
Copy link

juinshell commented Jan 13, 2024

Description

Hi Nvidia Team,

I'm testing run a DNN workflow(SSD+Resnet50) with TensorRT and NVIDIA DALI(for preprocess data). I convert a Pytorch SSD pretrained model to onnx format and load it to build a TensorRT engine.

However, when I test run SSD inference, I do not know how to adapt other framework's data into TensorRT's input. For example, how to convert NVIDIA DALI's nvidia.dali.backend_impl.TensorGPU into tensorrt.tensorrt.IExecutionContext.execute()'s input? If this is difficult, is it possible to convert a 'torch.Tensor' as input data of TensorRT?

Environment

TensorRT Version: 8.2.5.1

NVIDIA GPU: V100

NVIDIA Driver Version: 520.61.05

CUDA Version: 11.8

CUDNN Version: 8401

Operating System: 4.15.0-45-generic #48-Ubuntu

Python Version (if applicable): 3.8.13

Tensorflow Version (if applicable): N/A

PyTorch Version (if applicable): 1.13.0

Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:22.06-py3

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import torch
import tensorrt as trt
import numpy as np
import time

import pycuda.driver as cuda  

import pycuda.autoinit

print("PyCUDA Version:", cuda.get_version())
print("CUDNN Version:", torch.backends.cudnn.version())
print("Torch Version:", torch.__version__)
print("TensorRT Version:", trt.__version__)

batch = 4
# Step 1: Load pretrained model from PyTorch Hub
ssd_model = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd')
ssd_model.eval()
ssd_model = ssd_model.cuda()

# Step 2: Export PyTorch model to ONNX
ssd_dummy_input = torch.randn(batch, 3, 300, 300).cuda()
ssd_onnx_file_path = 'ssd.onnx'
torch.onnx.export(ssd_model, ssd_dummy_input, ssd_onnx_file_path)

# Step 3: Create a TensorRT engine from the ONNX model
TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

def make_rt_engine(onnx_file_path):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 34
        builder.max_batch_size = batch
        
        if builder.platform_has_fast_fp16:
            config.set_flag(trt.BuilderFlag.FP16)

        # Parse the ONNX model
        print("Parsing onnx file {}...".format(onnx_file_path))
        with open(onnx_file_path, 'rb') as model_file:
            if not parser.parse(model_file.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))

        # Build and serialize the TensorRT engine
        profile = builder.create_optimization_profile()
        config.add_optimization_profile(profile)

        print("Building an engine...")
        engine = builder.build_engine(network, config)
    return engine

def execute_ssd_engine(engine, input_data):
    with engine.create_execution_context() as context:
        max_nboxes = 8732
        h_ploc = np.zeros((batch, 4, max_nboxes), dtype=np.float32)
        h_plabel = np.zeros((batch, 81, max_nboxes), dtype=np.float32)

        d_ploc = cuda.mem_alloc(h_ploc.nbytes) 
        d_plabel = cuda.mem_alloc(h_plabel.nbytes)

        print("d_ploc: ", d_ploc)
        print("d_plabel: ", d_plabel)
        print("context type: ", type(context))
        input("Press Enter to continue...")
        
        # Execute inference
        # warmup
        print("[SSD]warmup...")
        for i in range(50):
            context.execute(batch, [int(input_data), int(d_ploc), int(d_plabel)])
        
        print("[SSD]start inference...")
        times = []
        for i in range(100):
            T1=time.perf_counter()   
            context.execute(batch, [int(input_data), int(d_ploc), int(d_plabel)])
            T2=time.perf_counter()
            times.append(T2-T1)
            if (i + 1) % 10 == 0:
                print('[SSD]TensorRT Inference {:d}/{:d}: {:.3f}ms'.format(i + 1, 100, np.mean(times) * 1000))

        # Transfer output data to host
        cuda.memcpy_dtoh(h_ploc, d_ploc)
        cuda.memcpy_dtoh(h_plabel, d_plabel)
        print("[SSD]finish!")
    return h_ploc, h_plabel

ssd_engine = make_rt_engine(ssd_onnx_file_path)

# DALI
import nvidia.dali.fn as fn
from nvidia.dali.pipeline.experimental import pipeline_def
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch import feed_ndarray

@pipeline_def()
def simple_pipeline(resize_flag=True, resize_x=300, resize_y=300):
    jpegs, labels = fn.readers.file(file_root='./images',
                                    random_shuffle=True,
                                    name="Reader")
    images = fn.decoders.image(jpegs, device="mixed", output_type=types.RGB)

    # Resize images
    if resize_flag:
        resized_images = fn.resize(images, resize_x=resize_x, resize_y=resize_y, device="gpu")

    # Normalize images
    mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
    std = [0.229 * 255, 0.224 * 255, 0.225 * 255]

    normalized_images = fn.crop_mirror_normalize(resized_images,
                                    mean=mean,
                                    std=std,
                                    output_dtype=types.FLOAT,
                                    device="gpu")

    return normalized_images, resized_images

pipe = simple_pipeline(batch_size=4, num_threads=3, device_id=0)
pipe.build()

import nvidia.dali.types as types

_images, resized_images = pipe.run()
print("-----_images type: ", type(_images)) # nvidia.dali.backend_impl.TensorListGPU
print("-----_images.dtype: ", _images.dtype) # DALIDataType.FLOAT
print("-----_images.shape: ", _images.shape()) # [(3, 300, 300), (3, 300, 300), (3, 300, 300), (3, 300, 300)]
print("-----_images.layout:", _images.layout()) # CHW

input("Press Enter to continue...")

_images_dali_tensor = _images.as_tensor()
print("-----_images_dali_tensor type:", type(_images_dali_tensor))  # <class 'nvidia.dali.backend_impl.TensorGPU'>
print("-----_images_dali_tensor dtype:", _images_dali_tensor.dtype())  # =f
print("-----_images_dali_tensor shape:", _images_dali_tensor.shape())  # (4, 3, 300, 300)
print("-----_images_dali_tensor layout:", _images_dali_tensor.layout())  # NCHW

_images_dali_tensor_data_ptr = _images_dali_tensor.data_ptr()
print("-----_images_dali_tensor_data_ptr:", _images_dali_tensor_data_ptr) # 4398129168384
...

Have you tried the latest release?: no

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): do not test

@juinshell juinshell changed the title Segmentation fault (core dumped) failure of TensorRT 8.2 output when running SSD on GPU V100 'ValueError: only one element tensors can be converted to Python scalars' failure of TensorRT 8.2 when running NVIDIA DALI on GPU V100 Jan 13, 2024
@zerollzeng
Copy link
Collaborator

is it possible to convert a 'torch.Tensor' as input data of TensorRT?

Check #2506

@zerollzeng zerollzeng self-assigned this Jan 15, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jan 15, 2024
@juinshell
Copy link
Author

Hi @zerollzeng ,
Thanks for your help! I use .data_ptr() and it works! In my test, tensorRT can take NVIDIA DALI TensorListGPU.data_ptr as the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants