HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

bharrisau · 2023-09-01T03:29:21Z

I'm using the v2 branch for this, but the below is what is currently needed to get cudaHostRegister pinned buffers working.

Link in CUDA runtime

// cudaError_t is enum #[repr(u32)]
#[link(name = "cudart", kind = "dylib")]
extern "C" {
    pub fn cudaHostRegister(ptr: *mut ::std::os::raw::c_void, size: usize, flags: ::std::os::raw::c_uint) -> cudaError_t;
    pub fn cudaHostUnregister(ptr: *mut ::std::os::raw::c_void) -> cudaError_t;
}

Alloc your buffer and register it with cuda

let mut data1 = vec![0_u8; 16*1536*2048];
unsafe { cudaHostRegister(data1.as_mut_ptr() as _, data1.len(), cudaHostRegisterDefault) };

Create the OrtValue - but can't use Value::from_array as it clones the data every time

// o_mem and o_value wrap calls to CreateMemoryInfo and CreateTensorWithDataAsOrtValue
let shape = vec![16_i64, 1, 1536, 2048];
let mem_ptr = o_mem(ort::AllocationDevice::CPU, 0, ort::AllocatorType::Device, ort::MemType::CPUInput);
let input_tensor = unsafe { Value::from_raw(o_value(&mut data1, &shape, mem_ptr), session.inner()) };
bind.bind_input("images", input_tensor).unwrap();
bind.run().unwrap();

Performance difference

Using 50MB input buffers. PINNED buffer saves 1ms or 1.95%. Avoiding extra copy from ort::from_array saves 19ms or 27%. Model is a yolov8m with custom starting layer for debayering and resize. Running on Quadro RTX 4000.

nvprof - compare with/without cudaHostRegister - 100 iterations
    Time   Name
643.54ms   [CUDA memcpy HtoD] TensorRT with
747.56ms   [CUDA memcpy HtoD] TensorRT without
659.65ms   [CUDA memcpy HtoD] CUDA with
760.43ms   [CUDA memcpy HtoD] CUDA without

nvsys analyze reports on PAGED async transfers without cudaHostRegister

Criterion results - pinned vs ort::from_raw() vs standard ort::from_array()

forward_mymodel_onnx_cuda_pinned
                        time:   [80.781 ms 80.847 ms 80.910 ms]
forward_mymodel_onnx_cuda_ort_fromraw
                        time:   [81.675 ms 81.856 ms 82.093 ms]
forward_mymodel_onnx_cuda_ort_fromarray
                        time:   [100.94 ms 101.06 ms 101.20 ms]

forward_mymodel_onnx_trt_pinned
                        time:   [49.893 ms 49.950 ms 50.007 ms]
forward_mymodel_onnx_trt_ort_fromraw
                        time:   [50.793 ms 50.943 ms 51.175 ms]
forward_mymodel_onnx_trt_ort_fromarray
                        time:   [69.574 ms 69.701 ms 69.833 ms]

The text was updated successfully, but these errors were encountered:

bharrisau mentioned this issue Sep 1, 2023

v2.0 #78

Merged

8 tasks

decahedron1 added documentation Improvements or additions to documentation ep: cuda related to cuda execution provider ep: tensorrt labels Sep 20, 2023

decahedron1 closed this as completed Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

bharrisau commented Sep 1, 2023

HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

Comments

bharrisau commented Sep 1, 2023

Performance difference