Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

Closed
3 tasks
bharrisau opened this issue Sep 1, 2023 · 0 comments
Closed
3 tasks

HowTo - Notes to use pinned host buffers for cuda and tensorrt #88

bharrisau opened this issue Sep 1, 2023 · 0 comments
Labels
documentation Improvements or additions to documentation ep: cuda related to cuda execution provider ep: tensorrt

Comments

@bharrisau
Copy link
Contributor

I'm using the v2 branch for this, but the below is what is currently needed to get cudaHostRegister pinned buffers working.

  • Link in CUDA runtime
// cudaError_t is enum #[repr(u32)]
#[link(name = "cudart", kind = "dylib")]
extern "C" {
    pub fn cudaHostRegister(ptr: *mut ::std::os::raw::c_void, size: usize, flags: ::std::os::raw::c_uint) -> cudaError_t;
    pub fn cudaHostUnregister(ptr: *mut ::std::os::raw::c_void) -> cudaError_t;
}
  • Alloc your buffer and register it with cuda
let mut data1 = vec![0_u8; 16*1536*2048];
unsafe { cudaHostRegister(data1.as_mut_ptr() as _, data1.len(), cudaHostRegisterDefault) };
  • Create the OrtValue - but can't use Value::from_array as it clones the data every time
// o_mem and o_value wrap calls to CreateMemoryInfo and CreateTensorWithDataAsOrtValue
let shape = vec![16_i64, 1, 1536, 2048];
let mem_ptr = o_mem(ort::AllocationDevice::CPU, 0, ort::AllocatorType::Device, ort::MemType::CPUInput);
let input_tensor = unsafe { Value::from_raw(o_value(&mut data1, &shape, mem_ptr), session.inner()) };
bind.bind_input("images", input_tensor).unwrap();
bind.run().unwrap();

Performance difference

Using 50MB input buffers. PINNED buffer saves 1ms or 1.95%. Avoiding extra copy from ort::from_array saves 19ms or 27%. Model is a yolov8m with custom starting layer for debayering and resize. Running on Quadro RTX 4000.

nvprof - compare with/without cudaHostRegister - 100 iterations
    Time   Name
643.54ms   [CUDA memcpy HtoD] TensorRT with
747.56ms   [CUDA memcpy HtoD] TensorRT without
659.65ms   [CUDA memcpy HtoD] CUDA with
760.43ms   [CUDA memcpy HtoD] CUDA without

nvsys analyze reports on PAGED async transfers without cudaHostRegister

Criterion results - pinned vs ort::from_raw() vs standard ort::from_array()

forward_mymodel_onnx_cuda_pinned
                        time:   [80.781 ms 80.847 ms 80.910 ms]
forward_mymodel_onnx_cuda_ort_fromraw
                        time:   [81.675 ms 81.856 ms 82.093 ms]
forward_mymodel_onnx_cuda_ort_fromarray
                        time:   [100.94 ms 101.06 ms 101.20 ms]

forward_mymodel_onnx_trt_pinned
                        time:   [49.893 ms 49.950 ms 50.007 ms]
forward_mymodel_onnx_trt_ort_fromraw
                        time:   [50.793 ms 50.943 ms 51.175 ms]
forward_mymodel_onnx_trt_ort_fromarray
                        time:   [69.574 ms 69.701 ms 69.833 ms]
@bharrisau bharrisau mentioned this issue Sep 1, 2023
Merged
8 tasks
@decahedron1 decahedron1 added documentation Improvements or additions to documentation ep: cuda related to cuda execution provider ep: tensorrt labels Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ep: cuda related to cuda execution provider ep: tensorrt
Projects
None yet
Development

No branches or pull requests

2 participants