-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port allocation doesn't work for system jobs with docker driver #8934
Comments
When I try a similar job with |
Glad I found this issue, thought I was going crazy with different configuration options (network port stanza wise and docker config Nomad Version: Nomad v0.12.5 |
Hi folks! I suspect this was fixed by #8822, which hasn't made it into the changelog for the upcoming 0.13.0 yet. Trying this exact same jobspec on current master:
|
@tgross I don't think #8822 is the complete fix. I patched that into a build and I continue to get the errors described above:
Is there a known good workaround short of reverting the nomad version in the cluster to one with known functional system networking? |
Ok, that's interesting. I've just re-verified it on current This wasn't an area I worked on, so I'm going to admit I'm not sure which patch it was that landed between 0.12.5 and |
Sure. I also tested 0.12.6 with the patch as well, so its somewhere between 0.12.6 and master. As an aside, it would be really useful if listed on the page with the downloads if there could be a listing of known defects. Had I known this was an issue, I would not have updated the packages in Void. |
@tgross any update from your end? I've bashed my head against enough other issues in 0.12.x that I'm pretty close to rolling back, but if I can provide any more debugging information before I do I'm happy to send logs your way. Looks like this is reasonably well understood, just bafflingly not backported into a released non-beta version. |
Hey @the-maldridge I just took a look and confirmed that the 0.12.7 release contains this bug. I patched it with #8822 locally and retested and could not reproduce. #8822 didn't make it into the 0.12 line but I will check with the team about if we backport this one. I know you said you tried applying that patch and still got this error. Could you recheck that for me once more just to be sure I didn't do something wrong on my end. |
Sure, I'll give it another try. To be clear I will apply the patch as generated by github (https://patch-diff.githubusercontent.com/raw/hashicorp/nomad/pull/8822.patch) to the release tarball for 0.12.7 as retrieved from github. If there are other steps you think I should be taking to make sure my build matches yours, let me know. |
Looks the same to me, heres my workspace looks like:
|
No dice. I can send configs, sample jobs, log entries, pretty much whatever you want; this is Void Linux's cluster which is entirely on github save for the encryption keys. |
Actually the job file is short enough I can just paste it:
|
For completeness here's a copy of the built binary. The only change between this binary and the one in prod is this was stripped and upx'd to fit within the github upload limits. |
@the-maldridge just doing some follow-up here:
# statically linked... but that's probably because it's upx'd?
$ ldd ./pkg/nomad
not a dynamic executable
# blows up
$ sudo strace ./pkg/nomad version
execve("./pkg/nomad", ["./pkg/nomad", "version"], [/* 18 vars */]) = 0
open("/proc/self/exe", O_RDONLY) = 3
mmap(NULL, 21102734, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe77e6b2000
mmap(0x7fe77e6b2000, 21102336, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x7fe77e6b2000
mprotect(0x7fe77fad1000, 4238, PROT_READ|PROT_EXEC) = 0
readlink("/proc/self/exe", "/opt/gopath/src/github.com/hashi"..., 4095) = 52
mmap(0x400000, 72216576, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400000
mmap(0x400000, 15048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400000
mprotect(0x400000, 15048, PROT_READ) = 0
mmap(0x404000, 33536141, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0x4000) = 0x404000
mprotect(0x404000, 33536141, PROT_READ|PROT_EXEC) = 0
mmap(0x2400000, 36807552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0x2000000) = 0x2400000
mprotect(0x2400000, 36807552, PROT_READ) = 0
mmap(0x471b000, 1575833, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0x431a000) = 0x471b000
mprotect(0x471b000, 1575833, PROT_READ|PROT_WRITE) = 0
mmap(0x489c000, 274152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x489c000
open("/lib/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
exit(127) = ?
+++ exited with 127 +++ |
Very spooky. I have not tried any of the 1.0 builds as this is a production cluster and the official guidance was that 1.0 was not production stable at the time I was hitting this bug. I have since removed my need for this feature by deploying the task that required it (node_exporter) outside of nomad since it would have been very fiddly to get all the host volumes working correctly across the fleet. The build for Nomad uses this file. The build command should ultimately boil down to Void's go binaries are not PIE, not UPX'd and not stripped (Go is such a fun language to try and run standard packaging processes across). The output of ldd is as follows:
I'll bet that whatever distro you run has the linker at /usr/lib/, whereas due to path precedent reasons Void has it at /lib/. Void does maintain vagrant boxen at https://app.vagrantup.com/voidlinux/. If you want it to be truly up to date, run
|
Thanks @the-maldridge. We're in the midst of prepping for 1.0 GA and Nick is on leave, but I'll see if I can get a repro working for you in the next week or so. |
Hey @the-maldridge happy new year. I was discussing this today with the team and @drewbailey pointed me to #9736 which fixes a case where ports aren't persisted correctly on job updates. Were you experiencing this with initial job deployment (what Tim and myself tested), or is this when you're updating an existing job in the cluster? |
Happy new year to you as well @nickethier ! This was being experienced with brand new jobs to the cluster. I'm afraid I don't have a good way to test this anymore as due to it not working I abandoned docker networking entirely and now use CNI plugins which I found to be a more likely path to be working. |
I'm sorry we weren't able to reproduce on our end @the-maldridge but I'm glad you've found a solution thats working for you. @sickill are you still seeing this with the latest release. I believe the above fixes should have solved your original problem and am inclined to close this issue as we're no longer able to reproduce on our end. |
I would also be fine to close, it seems pretty clear that driver level networking is no longer a well trod path, and that the intended mechanisms are CNI with group level networking. |
I tested with the latest release and I confirm it works properly now, so we can close it 👍 Thx! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.12.5 (514b0d6)
Operating system and Environment details
Ubuntu 20.04
Issue
Deploying a system job with Docker driver and a port specified in network stanza results in
Port "http" not found, check network stanza
error reported by Docker driver.Reproduction steps
Run following job file and observe allocations fail with the above error.
Job file (if appropriate)
I also tried the deprecated syntax, also no luck:
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: