Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[java] CUDA & TensorRT options fix #20549

Merged
merged 6 commits into from
May 5, 2024

Conversation

Craigacp
Copy link
Contributor

@Craigacp Craigacp commented May 3, 2024

Description

I misunderstood how UpdateCUDAProviderOptions and UpdateTensorRTProviderOptions work in the C API, I had assumed that they updated the options struct, however they re-initialize the struct to the defaults then only apply the values in the update. I've rewritten the Java bindings for those classes so that they aggregate all the updates and apply them in one go. I also updated the C API documentation to note that these classes have this behaviour. I've not checked if any of the other providers with an options struct have this behaviour, we only expose CUDA and TensorRT's options in Java.

There's a small unrelated update to add a private constructor to the Fp16Conversions classes to remove a documentation warning (they shouldn't be instantiated anyway as they are utility classes containing static methods).

Motivation and Context

Fixes #20544.

@jywu-msft
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@jywu-msft
Copy link
Member

/azp run Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@jywu-msft jywu-msft added api:Java issues related to the Java API ep:TensorRT issues related to TensorRT execution provider ep:CUDA issues related to the CUDA execution provider release:1.18.0 labels May 4, 2024
@jywu-msft
Copy link
Member

fyi: build warnings treated as error.

##[error]java\src\main\native\ai_onnxruntime_providers_OrtTensorRTProviderOptions.c(49,71): Error C2220: the following warning is treated as an error
101>C:\a_work\1\s\java\src\main\native\ai_onnxruntime_providers_OrtTensorRTProviderOptions.c(49,71): error C2220: the following warning is treated as an error [C:\a_work\1\b\Debug\onnxruntime4j_jni.vcxproj]
##[warning]java\src\main\native\ai_onnxruntime_providers_OrtTensorRTProviderOptions.c(49,71): Warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data
101>C:\a_work\1\s\java\src\main\native\ai_onnxruntime_providers_OrtTensorRTProviderOptions.c(49,71): warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data [C:\a_work\1\b\Debug\onnxruntime4j_jni.vcxproj]

     C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\bin\HostX64\x64\CL.exe /c /I"C:\a\_work\1\b\Debug\_deps\utf8_range-src" /IC:\a\_work\1\s\include\onnxruntime /IC:\a\_work\1\s\include\onnxruntime\core\session /I"C:\a\_work\1\b\Debug\_deps\pytorch_cpuinfo-src\include" /IC:\a\_work\1\b\Debug /IC:\a\_work\1\s\onnxruntime /I"C:\a\_work\1\b\Debug\_deps\date-src\include" /I"C:\a\_work\1\b\Debug\_deps\flatbuffers-src\include" /IC:\a\_work\1\s\onnxruntime\test\util\include /I"C:\a\_work\1\b\Debug\_deps\eigen-src" /I"C:\a\_work\1\b\Debug\_deps\safeint-src" /I"C:\a\_work\1\b\Debug\_deps\re2-src" /I"C:\a\_work\1\b\Debug\_deps\microsoft_wil-src\include" /I"C:\a\_work\1\b\Debug\_deps\nlohmann_json-src\single_include" /I"C:\a\_work\1\b\Debug\_deps\onnx-src" /I"C:\a\_work\1\b\Debug\_deps\onnx-build" /I"C:\a\_work\1\b\Debug\_deps\mp11-src\include" /I"C:\a\_work\1\b\Debug\_deps\pytorch_clog-src\deps\clog\include" /Zi /nologo /W4 /WX /diagnostics:column /sdl /FS /Od /Ob0 /D _MBCS /D WIN32 /D _WINDOWS /D WINAPI_FAMILY=100 /D WINVER=0x0A00 /D _WIN32_WINNT=0x0A00 /D NTDDI_VERSION=0x0A000000 /D ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH /D EIGEN_HAS_C99_MATH /D CPUINFO_SUPPORTED /D CPUINFO_SUPPORTED_PLATFORM=1 /D EIGEN_USE_THREADS /D PLATFORM_WINDOWS /D NOGDI /D NOMINMAX /D _USE_MATH_DEFINES /D _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS /D ONLY_C_LOCALE=0 /D SKIP_DEFAULT_LOGGER_TESTS /D ORT_ENABLE_STREAM /D EIGEN_MPL2_ONLY /D EIGEN_HAS_CONSTEXPR /D EIGEN_HAS_VARIADIC_TEMPLATES /D EIGEN_HAS_CXX11_MATH /D EIGEN_HAS_CXX11_ATOMIC /D EIGEN_STRONG_INLINE=inline /D ORT_RUN_EXTERNAL_ONNX_TESTS /D GTEST_HAS_ABSL=1 /D JSON_USE_IMPLICIT_CONVERSIONS=1 /D JSON_DIAGNOSTICS=0 /D ONNX_NAMESPACE=onnx /D ONNX_ML=1 /D ONNX_USE_LITE_PROTO=1 /D __ONNX_NO_DOC_STRINGS /D "CMAKE_INTDIR=\"Debug\"" /Gm- /EHsc /RTC1 /MDd /GS /guard:cf /fp:precise /Qspectre /Zc:wchar_t /Zc:forScope /Zc:inline /GR /std:c++17 /Fo"onnxruntime_test_all.dir\Debug\\" /Fd"onnxruntime_test_all.dir\Debug\vc143.pdb" /external:W0 /external:templates- /Gd /TP /wd26812 /wd4251 /wd4201 /wd4324 /wd5054 /wd6330 /wd4127 /wd6326 /wd26409 /wd26426 /wd26451 /wd4244 /errorReport:queue  /external:I "C:/a/_work/1/b/Debug/_deps/gsl-src/include" /external:I "C:/a/_work/1/b/Debug/installed/include" /external:I "C:/a/_work/1/b/Debug/_deps/googletest-src/googletest/include" /external:I "C:/a/_work/1/b/Debug/_deps/googletest-src/googletest" /external:I "C:/a/_work/1/b/Debug/_deps/googletest-src/googlemock/include" /external:I "C:/a/_work/1/b/Debug/_deps/googletest-src/googlemock" /external:IC:\a\_work\1\b\Debug\_deps\microsoft_wil-src\include /external:IC:\a\_work\1\b\Debug\_deps\nlohmann_json-src\single_include\ /external:IC:\a\_work\1\b\Debug\_deps\onnx-src /external:IC:\a\_work\1\b\Debug\_deps\onnx-build /external:IC:\a\_work\1\b\Debug\installed\include /external:IC:\a\_work\1\b\Debug\_deps\re2-src /external:IC:\a\_work\1\b\Debug\_deps\mp11-src\include /external:IC:\a\_work\1\b\Debug\_deps\safeint-src /external:IC:\a\_work\1\b\Debug\_deps\flatbuffers-src\include /external:IC:\a\_work\1\b\Debug\_deps\gsl-src\include /external:IC:\a\_work\1\b\Debug\_deps\date-src\include /external:IC:\a\_work\1\b\Debug\_deps\pytorch_clog-src\deps\clog\include /external:IC:\a\_work\1\b\Debug\_deps\pytorch_cpuinfo-src\include /utf-8 /experimental:external /external:IC:/a/_work/1/s/cmake /external:IC:/a/_work/1/b/Debug /w15038 "C:\a\_work\1\s\onnxruntime\test\framework\bfc_arena_test.cc"

##[error]java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(50,71): Error C2220: the following warning is treated as an error
101>C:\a_work\1\s\java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(50,71): error C2220: the following warning is treated as an error [C:\a_work\1\b\Debug\onnxruntime4j_jni.vcxproj]
##[warning]java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(50,71): Warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data
101>C:\a_work\1\s\java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(50,71): warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data [C:\a_work\1\b\Debug\onnxruntime4j_jni.vcxproj]
##[warning]java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(52,75): Warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data
101>C:\a_work\1\s\java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(52,75): warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data [C:\a_work\1\b\Debug\onnxruntime4j_jni.vcxproj]
##[warning]java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(59,71): Warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data
101>C:\a_work\1\s\java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(59,71): warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data [C:\a_work\1\b\Debug\onnxruntime4j_jni.vcxproj]
##[warning]java\src\main\native\ai_onnxruntime_providers_OrtCUDAProviderOptions.c(61,73): Warning C4267: 'function': conversion from 'size_t' to 'jsize', possible loss of data
101>C:\a_w

@Craigacp
Copy link
Contributor Author

Craigacp commented May 5, 2024

Ok, I've pushed out a fix for the compile error.

@jywu-msft
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@jywu-msft
Copy link
Member

/azp run Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@jywu-msft
Copy link
Member

/azp run Linux OpenVINO CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jywu-msft jywu-msft merged commit a366920 into microsoft:main May 5, 2024
115 of 119 checks passed
@Craigacp Craigacp deleted the java-cuda-opts-fix branch May 6, 2024 00:05
@sophies927 sophies927 added the triage:approved Approved for cherrypicks for release label May 6, 2024
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
### Description
I misunderstood how UpdateCUDAProviderOptions and
UpdateTensorRTProviderOptions work in the C API, I had assumed that they
updated the options struct, however they re-initialize the struct to the
defaults then only apply the values in the update. I've rewritten the
Java bindings for those classes so that they aggregate all the updates
and apply them in one go. I also updated the C API documentation to note
that these classes have this behaviour. I've not checked if any of the
other providers with an options struct have this behaviour, we only
expose CUDA and TensorRT's options in Java.

There's a small unrelated update to add a private constructor to the
Fp16Conversions classes to remove a documentation warning (they
shouldn't be instantiated anyway as they are utility classes containing
static methods).

### Motivation and Context
Fixes microsoft#20544.
@yihonglyu yihonglyu added the cherry-picked Cherry-picked for a cherrypicks branch label May 9, 2024
yihonglyu pushed a commit that referenced this pull request May 9, 2024
### Description
I misunderstood how UpdateCUDAProviderOptions and
UpdateTensorRTProviderOptions work in the C API, I had assumed that they
updated the options struct, however they re-initialize the struct to the
defaults then only apply the values in the update. I've rewritten the
Java bindings for those classes so that they aggregate all the updates
and apply them in one go. I also updated the C API documentation to note
that these classes have this behaviour. I've not checked if any of the
other providers with an options struct have this behaviour, we only
expose CUDA and TensorRT's options in Java.

There's a small unrelated update to add a private constructor to the
Fp16Conversions classes to remove a documentation warning (they
shouldn't be instantiated anyway as they are utility classes containing
static methods).

### Motivation and Context
Fixes #20544.
@yihonglyu yihonglyu added the rel-merged Cherrypicks merged into release label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:Java issues related to the Java API cherry-picked Cherry-picked for a cherrypicks branch ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider rel-merged Cherrypicks merged into release release:1.18.0 triage:approved Approved for cherrypicks for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Memory Allocation Anomaly Across Devices in OrtCUDAProviderOptions
4 participants