You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider externalizing the nvidia GPU device plugin, to ease targeting alpine and musl-based Linux distributions, and avoid incidental interactions in hosts with old nvidia drivers.
Background
The nvidia plugin depends on https://github.com/NVIDIA/gpu-monitoring-tools, a cgo wrapper library for invoking nvml C-based API for monitoring and managing the nvidia drivers. The nvml C-based library needs to be a shared dynamic library, it's not appropriate for static linking as it's dependent on installed drivers. To avoid requiring the library presence/installation, the library uses some dynamic linking flags to ignore unresolved symbols -ldl -Wl,--unresolved-symbols=ignore-in-object-files.
This presents few challenges:
First, CGO complicates the static compilation and cross compilation story. Though, when targeting linux, we already require CGO for libcontainer, so at least that's unavoidable.
Second, the dynamic linking technique doesn't work on alpine and musl-based Linux distributions. See #5535 and #5643 for more details.
Third it makes nomad binary fragile and susceptible to faults due to incompatible nvidia drivers being installed or to environment variables unexpectedly being set (e.g. LD_BIND_NOW).
Recommendation
Given that the nvidia driver is not a dominant use case, I'd recommend externalizing the nvidia plugin. This will ease managing our compilation flow, and make the binary robust. Though, operators needing the nvidia driver will need to download one more binary, which isn't so significant.
The text was updated successfully, but these errors were encountered:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Consider externalizing the nvidia GPU device plugin, to ease targeting alpine and musl-based Linux distributions, and avoid incidental interactions in hosts with old nvidia drivers.
Background
The nvidia plugin depends on https://github.com/NVIDIA/gpu-monitoring-tools, a cgo wrapper library for invoking nvml C-based API for monitoring and managing the nvidia drivers. The nvml C-based library needs to be a shared dynamic library, it's not appropriate for static linking as it's dependent on installed drivers. To avoid requiring the library presence/installation, the library uses some dynamic linking flags to ignore unresolved symbols
-ldl -Wl,--unresolved-symbols=ignore-in-object-files
.This presents few challenges:
First, CGO complicates the static compilation and cross compilation story. Though, when targeting linux, we already require CGO for libcontainer, so at least that's unavoidable.
Second, the dynamic linking technique doesn't work on alpine and musl-based Linux distributions. See #5535 and #5643 for more details.
Third it makes nomad binary fragile and susceptible to faults due to incompatible nvidia drivers being installed or to environment variables unexpectedly being set (e.g.
LD_BIND_NOW
).Recommendation
Given that the nvidia driver is not a dominant use case, I'd recommend externalizing the nvidia plugin. This will ease managing our compilation flow, and make the binary robust. Though, operators needing the nvidia driver will need to download one more binary, which isn't so significant.
The text was updated successfully, but these errors were encountered: