Dynamically loading `libnvidia-ml.so.1` instead of directly linking #313

mdemoret-nv · 2023-04-12T17:01:51Z

Currently, we install the driver into one of the CI images to allow for stub generation during compilation. This was needed because device_info.cpp linked directly against CUDA::nvml. This caused a link dependency on a driver library which is problematic when building with CPU-only docker images.

Instead, we dynamically load libnvidia-ml.so.1 (appending the .so.1 to avoid collisions with the stub file libnvidia-ml.so) and the necessary functions at runtime. If the library is not found, using a GPU will be disabled. This allows loading of the library for stub generation without needing a GPU.

dagardner-nv

LGTM, I wonder if there is any advantage to keeping container and test_container in the event that we ever do need additional packages for either the build or the test.

Similarly I think we should update the DOCKER_TARGET array in external/utilities/ci/runner/build_and_push.sh script to :

DOCKER_TARGET=${DOCKER_TARGET:-"build" "test"}

Even if for MRC the build and test targets remain aliases for base.

mdemoret-nv · 2023-04-12T22:59:02Z

LGTM, I wonder if there is any advantage to keeping container and test_container in the event that we ever do need additional packages for either the build or the test.

Similarly I think we should update the DOCKER_TARGET array in external/utilities/ci/runner/build_and_push.sh script to :
DOCKER_TARGET=${DOCKER_TARGET:-"build" "test"}
Even if for MRC the build and test targets remain aliases for base.

Good point. Changed the names of the targets to build and test with the matching changes here: nv-morpheus/utilities#30

cpp/mrc/src/internal/system/device_info.cpp

codecov · 2023-04-13T01:58:53Z

Codecov Report

Merging #313 (62e1ca4) into branch-23.07 (c3f67c0) will decrease coverage by 0.04%.
The diff coverage is 56.60%.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-23.07     #313      +/-   ##
================================================
- Coverage         73.28%   73.24%   -0.04%     
================================================
  Files               390      390              
  Lines             13352    13379      +27     
  Branches           1006     1008       +2     
================================================
+ Hits               9785     9800      +15     
- Misses             3567     3579      +12

Flag	Coverage Δ
cpp	`69.07% <56.60%> (-0.04%)`	⬇️
py	`42.22% <56.60%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
cpp/mrc/src/tests/test_topology.cpp	`98.41% <ø> (-0.05%)`	⬇️
cpp/mrc/src/internal/system/topology.cpp	`83.33% <50.00%> (-0.72%)`	⬇️
cpp/mrc/src/internal/system/device_info.cpp	`55.55% <56.86%> (+2.61%)`	⬆️

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3f67c0...62e1ca4. Read the comment docs.

mdemoret-nv · 2023-04-13T02:53:33Z

/merge

@jjacobelli

This PR is follows the MRC PR: nv-morpheus/MRC#313 which removes the explicit dependency on `libnvidia-ml.so` which allows us to no longer need the driver installed in our CI runner. This allows us to use the `gpu-v100-latest-1` tag to stay up to date with the CI images. @jjacobelli for Viz Authors: - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - David Gardner (https://github.com/dagardner-nv) URL: #877

Dynamically loading libnvidia-ml.so.1 instead of directly linking

1d17bda

mdemoret-nv added bug Something isn't working non-breaking Non-breaking change labels Apr 12, 2023

mdemoret-nv requested a review from a team as a code owner April 12, 2023 17:01

mdemoret-nv added 2 commits April 12, 2023 11:15

Updating to remove uses of the driver CI image

6edbd40

IWYU fixes

b500aab

mdemoret-nv requested a review from a team as a code owner April 12, 2023 17:22

Removing references to driver CI image

99abb70

mdemoret-nv added the 3 - Ready for Review label Apr 12, 2023

dagardner-nv approved these changes Apr 12, 2023

View reviewed changes

mdemoret-nv added 2 commits April 12, 2023 15:18

Reinstating multiple CI runner targets to build/test

e0915d4

Adding additional comments

87d76ff

drobison00 reviewed Apr 13, 2023

View reviewed changes

cpp/mrc/src/internal/system/device_info.cpp Show resolved Hide resolved

Adding a simple counter to avoid accidentally calling dlclose too early

62e1ca4

drobison00 approved these changes Apr 13, 2023

View reviewed changes

mdemoret-nv mentioned this pull request Apr 13, 2023

Removing explicit driver install from CI runner nv-morpheus/Morpheus#877

Merged

rapids-bot bot merged commit 25d9ca8 into nv-morpheus:branch-23.07 Apr 13, 2023

mdemoret-nv deleted the mdd_remove-nvml-direct-dep branch April 13, 2023 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically loading `libnvidia-ml.so.1` instead of directly linking #313

Dynamically loading `libnvidia-ml.so.1` instead of directly linking #313

mdemoret-nv commented Apr 12, 2023

dagardner-nv left a comment

mdemoret-nv commented Apr 12, 2023

codecov bot commented Apr 13, 2023

mdemoret-nv commented Apr 13, 2023

Dynamically loading libnvidia-ml.so.1 instead of directly linking #313

Dynamically loading libnvidia-ml.so.1 instead of directly linking #313

Conversation

mdemoret-nv commented Apr 12, 2023

dagardner-nv left a comment

Choose a reason for hiding this comment

mdemoret-nv commented Apr 12, 2023

codecov bot commented Apr 13, 2023

Codecov Report

mdemoret-nv commented Apr 13, 2023

Dynamically loading `libnvidia-ml.so.1` instead of directly linking #313

Dynamically loading `libnvidia-ml.so.1` instead of directly linking #313