Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.9.0 appears fundamentally broken on musl/Alpine #5535

Closed
the-maldridge opened this issue Apr 10, 2019 · 9 comments · Fixed by #5537
Closed

v0.9.0 appears fundamentally broken on musl/Alpine #5535

the-maldridge opened this issue Apr 10, 2019 · 9 comments · Fixed by #5537

Comments

@the-maldridge
Copy link

Nomad version

Can't get this far, the resultant binary is not usable. Release tag v0.9.0 though.

Operating system and Environment details

Alpine Linux 64-bit as available in the golang:1.12-alpine docker image.

Issue

Ignoring the FTBFS case if using the target make dev-ui (which fails due to the expected nodejs being very old compared to current LTS and current stable), the resulting binary is not usable due to problems in the nvidia support which doesn't appear possible to disable.

Reproduction steps

$ docker run --rm -it golang:1.12-alpine /bin/sh
# apk add git bash linux-headers bash g++ make
# mkdir -p src/github.com/hashicorp/nomad
# cd src/github.com/hashicorp/nomad
# git clone -b v0.9.0 https://github.com/hashicorp/nomad.git .
# make dev
# ./bin/nomad version

Attempting to run the built binary provides the following errors:

/go/src/github.com/hashicorp/nomad # ./bin/nomad 
Error relocating ./bin/nomad: nvmlDeviceGetMaxPcieLinkGeneration: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPowerManagementLimit: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetUUID: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPowerUsage: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetCurrentClocksThrottleReasons: symbol not found
Error relocating ./bin/nomad: nvmlSystemGetProcessName: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMemoryInfo: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMaxClockInfo: symbol not found
Error relocating ./bin/nomad: nvmlInit_v2: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetDisplayMode: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetAccountingBufferSize: symbol not found
Error relocating ./bin/nomad: nvmlEventSetFree: symbol not found
Error relocating ./bin/nomad: nvmlErrorString: symbol not found
Error relocating ./bin/nomad: nvmlShutdown: symbol not found
Error relocating ./bin/nomad: nvmlEventSetCreate: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPerformanceState: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetGraphicsRunningProcesses: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPcieThroughput: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMemoryErrorCounter: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetName: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPciInfo_v3: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetHandleByIndex_v2: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetComputeRunningProcesses: symbol not found
Error relocating ./bin/nomad: nvmlSystemGetDriverVersion: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMinorNumber: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetClockInfo: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetDecoderUtilization: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetMaxPcieLinkWidth: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetUtilizationRates: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetPersistenceMode: symbol not found
Error relocating ./bin/nomad: nvmlDeviceRegisterEvents: symbol not found
Error relocating ./bin/nomad: nvmlEventSetWait: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetTemperature: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetEncoderUtilization: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetDisplayActive: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetBAR1MemoryInfo: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetCount_v2: symbol not found
Error relocating ./bin/nomad: nvmlDeviceGetAccountingMode: symbol not found

Given that these all seem related to GPU scheduling which my organization doesn't need, I assumed I could just pass a build tag and shut this off. To my great disappointment this "feature" seems to be impossible to disable!

Is my best bet to stay on 0.8.7 until these bugs can be resolved?

@the-maldridge the-maldridge changed the title Build appears fundamentally broken on musl/Alpine v0.9.0 appears fundamentally broken on musl/Alpine Apr 10, 2019
@the-maldridge
Copy link
Author

Thanks!

@notnoop
Copy link
Contributor

notnoop commented Apr 10, 2019

Thanks @the-maldridge for raising this. I have added a nonvidia tag for disabling nvidia integration and confirmed that binaries work if compiled in an musl/alpine image.

It's a bit interesting though - I believe the nvidia library is a dynamically shared library, so its presence wasn't required for glibc based OSes. Do you know what might make musl compilation behave differently?

@the-maldridge
Copy link
Author

I believe it is because its a shared library that this error occurs. Compilation occurs just fine, but the resulting binary contains dynamically loaded non-relocatable symbols. I would need to conduct more research to figure out why this is the case, but at the moment I'm still trying to determine if the frontend can be built with a modern yarn/nodejs.

If I had to take a shot in the dark at why this is happening, I'd guess that nvml isn't correctly checking if there are build-time constraints that prevent it from working correctly and so it builds "blind" and then happens to work correctly on glibc systems. The proper fix is probably for them to return a nil implementation if the library can't be loaded.

@notnoop
Copy link
Contributor

notnoop commented Apr 10, 2019

I see - thanks for the background.

at the moment I'm still trying to determine if the frontend can be built with a modern yarn/nodejs.

This might be handy for you #5427 - you can skip building the frontend and use latest built release frontend assets by setting the ui go build tag.

@danbst
Copy link

danbst commented Jul 1, 2019

@the-maldridge @notnoop we had this problem even on glibc-based build:
NixOS/nixpkgs#63854

$ ldd ./result-bin/bin/nomad 
        linux-vdso.so.1 (0x00007ffd607d1000)
        libpthread.so.0 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/libpthread.so.0 (0x00007f93de658000)
        libdl.so.2 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/libdl.so.2 (0x00007f93de653000)
        libc.so.6 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/libc.so.6 (0x00007f93de49d000)
        /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib/ld-linux-x86-64.so.2 => /nix/store/avr2x43njlq4kyb1a9zgrh6fih5fq4si-glibc-2.27/lib64/ld-linux-x86-64.so.2 (0x00007f93de67b000)

$ ./result-bin/bin/nomad -v
./result-bin/bin/nomad: symbol lookup error: ./result-bin/bin/nomad: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses

@holtwilkins
Copy link
Contributor

holtwilkins commented Aug 1, 2019

Hey @notnoop , I don't think this issue should be closed. It may be the case that you can compile Nomad itself on Alpine for it to work on Alpine, but it seems like a bug if the official Nomad release simply can't execute on Alpine, no?

Apologies, I didn't see that #5643 was already open, this is the bug I was looking for :).

@dposton80
Copy link

We had a similar issue on our RHEL7 distribution:
./nomad: symbol lookup error: ./nomad: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses

I worked around this by building a temporary nvml.so that stubs out the functions, as follows:

objdump -T  nomad  | grep -o "nvml.*" | sort -u | sed 's/\(nvml.*\)/extern int \1(){ return 1; }/g' > nvml.c
gcc -c nvml.c -fpic
gcc -shared -o nvml.so nvml.o
rm nvml.o rnvml.c

Then run nomad with LD_PRELOAD:
LD_PRELOAD=./nvml.so ./nomad

@dposton80
Copy link

In fact, I discovered that the issue was caused with having environment variable LD_BIND_NOW set. Doing
export -n LD_BIND_NOW
fixed it.

@github-actions
Copy link

github-actions bot commented Nov 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants