Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.0.3 nvml issues. #10012

Closed
remus-corneliu opened this issue Feb 11, 2021 · 8 comments
Closed

Nomad 1.0.3 nvml issues. #10012

remus-corneliu opened this issue Feb 11, 2021 · 8 comments

Comments

@remus-corneliu
Copy link

I'm trying to run nomad in a docker container. version 0.8.7 results in a correct startup. while the latest version 1.0.3 results in the following

Error relocating /bin/nomad: nvmlDeviceGetGraphicsRunningProcesses: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlEventSetWait: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetPerformanceState: symbol not found
nomad-server | Error relocating /bin/nomad: __snprintf_chk: symbol not found
nomad-server | Error relocating /bin/nomad: __vfprintf_chk: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlSystemGetDriverVersion: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetPowerManagementLimit: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetMemoryErrorCounter: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetUUID: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlInit_v2: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetMaxClockInfo: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetUtilizationRates: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetMinorNumber: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlEventSetCreate: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetHandleByIndex_v2: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetMemoryInfo: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetAccountingMode: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetPciInfo_v3: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlSystemGetProcessName: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetPersistenceMode: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetClockInfo: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetComputeRunningProcesses: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetMaxPcieLinkWidth: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetAccountingBufferSize: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetCount_v2: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceRegisterEvents: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetPowerUsage: symbol not found
nomad-server | Error relocating /bin/nomad: __vsnprintf_chk: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetDisplayActive: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetEncoderUtilization: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetTemperature: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlEventSetFree: symbol not found
nomad-server | Error relocating /bin/nomad: __longjmp_chk: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetDisplayMode: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetBAR1MemoryInfo: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlShutdown: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetCurrentClocksThrottleReasons: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlErrorString: symbol not found
nomad-server | Error relocating /bin/nomad: __dprintf_chk: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetDecoderUtilization: symbol not found
nomad-server | Error relocating /bin/nomad: __fprintf_chk: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetMaxPcieLinkGeneration: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetPcieThroughput: symbol not found
nomad-server | Error relocating /bin/nomad: nvmlDeviceGetName: symbol not found

A link to the repository https://github.com/remus-corneliu/nomad

local environment setup :
docker for windows v20.10.2 with wsl2 support and running ubuntu18.04 distro

@tgross
Copy link
Member

tgross commented Feb 11, 2021

Hi @remus-corneliu!

I'm trying to run nomad in a docker container.

This is not a supported environment for running Nomad clients. It's technically possible but it's fraught with very low-level details that you'll need to get right, including running the container with CAP_NET_ADMIN and CAP_SYS_ADMIN capabilities, iptables/nftables binaries, and system library dependencies for CGO:

$ ldd $(which nomad)
        linux-vdso.so.1 =>  (0x00007fff923c4000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6585515000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6585311000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6584f47000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6585732000)

Error relocating /bin/nomad: nvmlDeviceGetGraphicsRunningProcesses: symbol not found

This looks like Nomad somehow managed to fingerprint an Nvidia GPU... not sure what PCI pass-thru it going to look like in the "turducken" environment you have there of nested virtualization layers. But it also looks like the container doesn't meet the installation requirements. This would have worked in pre-0.9.0 versions of Nomad that didn't have the device drivers.

You can disable the plugin via the following config:

plugin "nvidia-gpu" {
  config {
    enabled = false
  }
}

@AndrewChubatiuk
Copy link
Contributor

@remus-corneliu
I used to run nomad in docker and this is how i managed to do it

FROM alpine:latest

ARG NOMAD_VERSION=1.0.3
ARG HASHICORP_RELEASES=https://releases.hashicorp.com
ARG GLIBC_VERSION=2.32-r0

# Nomad installation
RUN wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub && \
    wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/${GLIBC_VERSION}/glibc-${GLIBC_VERSION}.apk && \
    apk add --update --no-cache glibc-${GLIBC_VERSION}.apk ca-certificates curl dumb-init gnupg libcap openssl su-exec && \
    gpg --keyserver pgp.mit.edu --recv-keys 91A6E7F85D05C65630BEF18951852D87348FFC4C && \
    mkdir -p /tmp/build && \
    cd /tmp/build && \
    wget ${HASHICORP_RELEASES}/nomad/${NOMAD_VERSION}/nomad_${NOMAD_VERSION}_linux_amd64.zip && \
    wget ${HASHICORP_RELEASES}/nomad/${NOMAD_VERSION}/nomad_${NOMAD_VERSION}_SHA256SUMS && \
    wget ${HASHICORP_RELEASES}/nomad/${NOMAD_VERSION}/nomad_${NOMAD_VERSION}_SHA256SUMS.sig && \
    gpg --batch --verify nomad_${NOMAD_VERSION}_SHA256SUMS.sig nomad_${NOMAD_VERSION}_SHA256SUMS && \
    grep nomad_${NOMAD_VERSION}_linux_amd64.zip nomad_${NOMAD_VERSION}_SHA256SUMS | sha256sum -c && \
    unzip -d /bin nomad_${NOMAD_VERSION}_linux_amd64.zip && \
    rm -rf /tmp/build && \
    apk del gnupg openssl && \
    rm -rf /root/.gnupg

RUN mkdir -p /mnt/data/nomad/data && \
    mkdir -p /nomad/config && \
    mkdir -p /etc/docker

VOLUME /mnt/data/nomad/data

EXPOSE 4646 4647 4648 4648/udp

ENTRYPOINT ["nomad", "agent", "-server"]

@tgross
Copy link
Member

tgross commented Feb 19, 2021

Looks like this one has been answered and we haven't heard back, so I'm going to close this issue.

@tgross tgross closed this as completed Feb 19, 2021
@tgross tgross removed their assignment Feb 19, 2021
@kunalsingthakur
Copy link

@tgross this is required for the alpine containers if you are using nomad binary in alpine container.
#10012 (comment)

I'm not sure how consul terraform consul-template binary works fine except nomad.

@tgross
Copy link
Member

tgross commented Sep 16, 2021

@tgross this is required for the alpine containers if you are using nomad binary in alpine container.
#10012` (comment)

I'm not sure how consul terraform consul-template binary works fine except nomad.

Because Consul, Terraform, and consul-template don't run GPU workloads, so they don't have dependencies on the nvidia driver (that's what nvml is), and nvidia is dependent on glibc, which doesn't really belong in Alpine. (Take that up with Alpine or nvidia 😁 ) #10796 externalizes the driver, but that PR hasn't been merged. Also, just a heads up that I'm not at HashiCorp anymore and this issue has been closed since February. You almost certainly shouldn't be running a Nomad client in a container anyways, but if you want to, you'll want to open a new issue and bring it to the attention of the maintainers.

@freimer
Copy link

freimer commented Oct 1, 2022

I believe running a nomad client in a container is a very valid use case when you consider GitOps and CI where you may need to run a job in nomad during a CI. Same reason you may want to run vault, terraform, packer, consul or any of the other HashiCorp tools. No, you are not going to run the server or agents, but the client is still a valid use case.

Anyway, this is where I'm at, trying to get all of the HashiCorp tools we use, or may want to use, in a docker container that is used as the CI run container. Seems, since it is written in Go like all (most?) HashiCorp tools it should just work as a single binary statically linked.

What I'm getting is:
Error relocating /usr/local/bin/nomad: __dprintf_chk: symbol not found

@freimer
Copy link

freimer commented Oct 1, 2022

Apparently there is a community version, and you can just do a apk add nomad. This is a community version though, not straight from HashiCorp.

@tgross
Copy link
Member

tgross commented Oct 3, 2022

@freimer this issue has been closed for over a year and your comment doesn't have anything to do with the Nvidia driver. There's an open issue for packaging Nomad in Docker images in #6487

@hashicorp hashicorp locked as resolved and limited conversation to collaborators Oct 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants