Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad v1.5.0-beta.1 panic: SetServer called twice. #16239

Closed
jdoss opened this issue Feb 22, 2023 · 3 comments · Fixed by #16250
Closed

Nomad v1.5.0-beta.1 panic: SetServer called twice. #16239

jdoss opened this issue Feb 22, 2023 · 3 comments · Fixed by #16250
Assignees
Milestone

Comments

@jdoss
Copy link

jdoss commented Feb 22, 2023

Nomad version

Nomad v1.5.0-beta.1 (3d735e7)

Operating system and Environment details

PRETTY_NAME="Fedora CoreOS 37.20230122.3.0"

Issue

I use smallstep's Certificate Manager to automate my internal PKI for my Nomad cluster and I am testing out Nomad v1.5.0-beta.1 and I noticed Nomad panicked after I renewed my short lived TLS cert and systemd restarted nomad.

Reproduction steps

Renew PKI TLS cert, hup nomad via systemd.

Expected Result

No panic.

Actual Result

A cute panic.

Nomad Server logs (if appropriate)

eb 22 05:05:39 host kernel: audit: type=1131 audit(1677042339.526:8902): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=step-issue-cert@nomad comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 22 05:05:39 host systemd[1]: Starting [email protected] - Renew smallstep TLS cert...
Feb 22 05:05:39 host step[86588]: The root certificate has been saved in /etc/nomad/tls/nomad-ca.crt.
Feb 22 05:05:39 host step[86607]: Your certificate has been saved in /etc/nomad/tls/nomad.crt.
Feb 22 05:05:39 host systemd[1]: Reloading nomad.service - Nomad...
Feb 22 05:05:39 host nomad[15974]: ==> Caught signal: hangup
Feb 22 05:05:39 host nomad[15974]: ==> Reloading configuration...
Feb 22 05:05:39 host systemd[1]: Reloaded nomad.service - Nomad.
Feb 22 05:05:39 host systemd[1]: [email protected]: Deactivated successfully.
Feb 22 05:05:39 host nomad[15974]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.838Z [INFO]  nomad: reloading server connections due to configuration changes
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.838Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.838Z [ERROR] nomad.rpc: failed to accept RPC conn: error="accept tcp 100.71.2.20:4647: use of closed network connection" delay=5ms
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.838Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.839Z [INFO]  client.fingerprint_mgr: reloading fingerprinter: fingerprinter=cni
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.839Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=100.71.2.23:4647
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.839Z [INFO]  agent: reloading HTTP server with new TLS configuration
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.841Z [INFO]  agent: requesting shutdown
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.841Z [INFO]  client: shutting down
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.841Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.841Z [INFO]  client.plugin: plugin manager finished: plugin-type=device
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.841Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.844Z [WARN]  nomad.rpc: failed TLS handshake: remote_addr=100.71.2.23:63632 error="tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2023-02-22T05:05:39Z is after 2023-02-21T23:53:46Z"
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.844Z [INFO]  client.driver_mgr: plugin process exited: driver=podman path=/etc/nomad/plugins/nomad-driver-podman pid=16042
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.845Z [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.845Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.845Z [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad: shutting down server
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [WARN]  nomad: serf: Shutdown without a Leave
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad: cluster leadership lost
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Nonvoter 8fc919fc-50ad-5f74-cce4-e28e528eab50 100.71.2.23:4647}"
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]:     2023-02-22T05:05:39.846Z [INFO]  agent: shutdown complete
Feb 22 05:05:39 host nomad[15974]: panic: SetServer called twice. first=0xc00017f0e0 second=0xc00017ea50
Feb 22 05:05:39 host nomad[15974]: goroutine 1 [running]:
Feb 22 05:05:39 host systemd[1]: Finished [email protected] - Renew smallstep TLS cert.
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*builtinAPI).SetServer(0xc0014447c0, 0xc00017ea50)
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/command/agent/http.go:551 +0xd9
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.NewHTTPServers(0xc000ea0000, 0xc00134d400)
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/command/agent/http.go:215 +0xebb
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).reloadHTTPServer(0xc0002b0000)
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/command/agent/command.go:963 +0x9b
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).handleReload(0xc0002b0000)
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/command/agent/command.go:1047 +0x695
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).handleSignals(0xc0002b0000)
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/command/agent/command.go:915 +0xfc
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc0002b0000, {0xc0000520a0, 0x6, 0x6})
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/command/agent/command.go:810 +0xf8a
Feb 22 05:05:39 host nomad[15974]: github.com/mitchellh/cli.(*CLI).Run(0xc0002f4f00)
Feb 22 05:05:39 host nomad[15974]:         github.com/mitchellh/[email protected]/cli.go:262 +0x5f8
Feb 22 05:05:39 host nomad[15974]: main.Run({0xc000052090?, 0x7, 0x7})
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/main.go:107 +0x350
Feb 22 05:05:39 host nomad[15974]: main.main()
Feb 22 05:05:39 host nomad[15974]:         github.com/hashicorp/nomad/main.go:77 +0x4e
Feb 22 05:05:39 host systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 22 05:05:39 host systemd[1]: nomad.service: Failed with result 'exit-code'.
Feb 22 05:05:39 host audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=nomad comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 22 05:05:39 host systemd[1]: nomad.service: Consumed 1h 8min 43.495s CPU time.
Feb 22 05:05:39 host kernel: audit: type=1131 audit(1677042339.875:8905): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=nomad comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 22 05:05:41 host systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.
Feb 22 05:05:41 host systemd[1]: Stopped nomad.service - Nomad.
@jdoss jdoss added the type/bug label Feb 22, 2023
@tgross
Copy link
Member

tgross commented Feb 22, 2023

Thanks for trying out the beta and opening this issue @jdoss! This panic looks like it's in the new Task API HTTP server introduced recently in #15864. I'm going to tag-in @schmichael on this, because I know he's still looking at a few related items for GA.

@tgross tgross added this to the 1.5.0 milestone Feb 22, 2023
@jdoss
Copy link
Author

jdoss commented Feb 22, 2023

No problem @tgross! Glad I can help. @schmichael this is very reproducible. I was able to get it to happen again on another node too. The only gotcha might be that I am running two nodes, both as servers and clients.

@schmichael
Copy link
Member

Thanks so much for the report @jdoss! I suspect it's the reloading (SIGHUP) that's triggering it. I'll get right on it and the fix will be in 1.5-rc1 coming out 🔜

schmichael added a commit that referenced this issue Feb 23, 2023
* agent: only reload HTTP servers that use TLS

* shutdown task api before client and improve names

Fixes #16239
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants