-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Memory Allocation Anomaly Across Devices in OrtCUDAProviderOptions #20544
Comments
Ok, I think I understand what's going on there. I had expected the C API's |
Thanks for the clarification! |
It is possible in the native code to pass multiple options at once, but not how I've written the Java binding to that native code. The Java object tracks all the options that are set, so I need to modify the WRT the memory allocation on GPU zero, that might be an artifact of how CUDA & ORT works, I think the primary GPU tends to end up with some driver & code related stuff in general, but someone with more CUDA expertise might be able to help there. |
I did additional tests regarding the memory allocation problem on gpu int status = cudart.cudaSetDevice(6); // set cuda device
checkCuda(cudart.CUDA_SUCCESS, status, "cudaSetDevice"); // check for exception and ORT code still allocates on OrtCUDAProviderOptions cudaOptions = new OrtCUDAProviderOptions();
cudaOptions.add("device_id", String.valueOf(6));
options.addCUDA(cudaOptions); and the problem above is solved, where nothing is allocated on gpu In other words, I assume there is also a bug where ORT starts allocating on gpu |
Additionally, without the cuda code: int status = cudart.cudaSetDevice(6); // set cuda device
checkCuda(cudart.CUDA_SUCCESS, status, "cudaSetDevice"); // check for exception When gpu |
### Description I misunderstood how UpdateCUDAProviderOptions and UpdateTensorRTProviderOptions work in the C API, I had assumed that they updated the options struct, however they re-initialize the struct to the defaults then only apply the values in the update. I've rewritten the Java bindings for those classes so that they aggregate all the updates and apply them in one go. I also updated the C API documentation to note that these classes have this behaviour. I've not checked if any of the other providers with an options struct have this behaviour, we only expose CUDA and TensorRT's options in Java. There's a small unrelated update to add a private constructor to the Fp16Conversions classes to remove a documentation warning (they shouldn't be instantiated anyway as they are utility classes containing static methods). ### Motivation and Context Fixes #20544.
### Description I misunderstood how UpdateCUDAProviderOptions and UpdateTensorRTProviderOptions work in the C API, I had assumed that they updated the options struct, however they re-initialize the struct to the defaults then only apply the values in the update. I've rewritten the Java bindings for those classes so that they aggregate all the updates and apply them in one go. I also updated the C API documentation to note that these classes have this behaviour. I've not checked if any of the other providers with an options struct have this behaviour, we only expose CUDA and TensorRT's options in Java. There's a small unrelated update to add a private constructor to the Fp16Conversions classes to remove a documentation warning (they shouldn't be instantiated anyway as they are utility classes containing static methods). ### Motivation and Context Fixes microsoft#20544.
### Description I misunderstood how UpdateCUDAProviderOptions and UpdateTensorRTProviderOptions work in the C API, I had assumed that they updated the options struct, however they re-initialize the struct to the defaults then only apply the values in the update. I've rewritten the Java bindings for those classes so that they aggregate all the updates and apply them in one go. I also updated the C API documentation to note that these classes have this behaviour. I've not checked if any of the other providers with an options struct have this behaviour, we only expose CUDA and TensorRT's options in Java. There's a small unrelated update to add a private constructor to the Fp16Conversions classes to remove a documentation warning (they shouldn't be instantiated anyway as they are utility classes containing static methods). ### Motivation and Context Fixes #20544.
Describe the issue
I am sharing below my
OrtCUDAProviderOptions
, which I use to set the gpu device to use for computation on a server with multiple GPUs.When setting the
deviceId
, I encounter buggy memory allocations.OrtCUDAProviderOptions
:results in:
It discards
deviceId
being set to6
and takes0
.cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
to become:results in:
where it selected the correct gpu, but resulted in
545MiB
being allocated on device0
without utilizing the device.cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
but addingcudaOptions.add("device_id", String.valueOf(6));
for selecting the device instead of directly specifying it in the constructor as in:results in:
which is the same as example 2.
To confirm whether
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
is being executed or ignored in example 3, I did some experiments and turned out it is not considered anymore and is neglected/shadowed out bycudaOptions.add("device_id", String.valueOf(6));
added afterwards.There are two problems here:
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
results in selecting the wrong device, in this case device0
all the time.0
, even though the correctdeviceId
has been utilized for the computation.As a workaround, I am exporting only one visible cuda device to avoid this problem.
To reproduce
Unfortunately model cannot be provided, but can write a toy example+model and supply if needed.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04.4 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
v1.17.3
ONNX Runtime API
Java
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda: 11.2, cudnn: 8.1.1
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: