You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use smallstep's Certificate Manager to automate my internal PKI for my Nomad cluster and I am testing out Nomad v1.5.0-beta.1 and I noticed Nomad panicked after I renewed my short lived TLS cert and systemd restarted nomad.
Reproduction steps
Renew PKI TLS cert, hup nomad via systemd.
Expected Result
No panic.
Actual Result
A cute panic.
Nomad Server logs (if appropriate)
eb 22 05:05:39 host kernel: audit: type=1131 audit(1677042339.526:8902): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=step-issue-cert@nomad comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 22 05:05:39 host systemd[1]: Starting [email protected] - Renew smallstep TLS cert...
Feb 22 05:05:39 host step[86588]: The root certificate has been saved in /etc/nomad/tls/nomad-ca.crt.
Feb 22 05:05:39 host step[86607]: Your certificate has been saved in /etc/nomad/tls/nomad.crt.
Feb 22 05:05:39 host systemd[1]: Reloading nomad.service - Nomad...
Feb 22 05:05:39 host nomad[15974]: ==> Caught signal: hangup
Feb 22 05:05:39 host nomad[15974]: ==> Reloading configuration...
Feb 22 05:05:39 host systemd[1]: Reloaded nomad.service - Nomad.
Feb 22 05:05:39 host systemd[1]: [email protected]: Deactivated successfully.
Feb 22 05:05:39 host nomad[15974]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.838Z [INFO] nomad: reloading server connections due to configuration changes
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.838Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.838Z [ERROR] nomad.rpc: failed to accept RPC conn: error="accept tcp 100.71.2.20:4647: use of closed network connection" delay=5ms
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.838Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.839Z [INFO] client.fingerprint_mgr: reloading fingerprinter: fingerprinter=cni
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.839Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=100.71.2.23:4647
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.839Z [INFO] agent: reloading HTTP server with new TLS configuration
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.841Z [INFO] agent: requesting shutdown
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.841Z [INFO] client: shutting down
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.841Z [INFO] client.plugin: shutting down plugin manager: plugin-type=device
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.841Z [INFO] client.plugin: plugin manager finished: plugin-type=device
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.841Z [INFO] client.plugin: shutting down plugin manager: plugin-type=driver
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.844Z [WARN] nomad.rpc: failed TLS handshake: remote_addr=100.71.2.23:63632 error="tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2023-02-22T05:05:39Z is after 2023-02-21T23:53:46Z"
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.844Z [INFO] client.driver_mgr: plugin process exited: driver=podman path=/etc/nomad/plugins/nomad-driver-podman pid=16042
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.845Z [INFO] client.plugin: plugin manager finished: plugin-type=driver
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.845Z [INFO] client.plugin: shutting down plugin manager: plugin-type=csi
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.845Z [INFO] client.plugin: plugin manager finished: plugin-type=csi
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad: shutting down server
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [WARN] nomad: serf: Shutdown without a Leave
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad: cluster leadership lost
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.raft: aborting pipeline replication: peer="{Nonvoter 8fc919fc-50ad-5f74-cce4-e28e528eab50 100.71.2.23:4647}"
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] nomad.rpc: closing server RPC connection
Feb 22 05:05:39 host nomad[15974]: 2023-02-22T05:05:39.846Z [INFO] agent: shutdown complete
Feb 22 05:05:39 host nomad[15974]: panic: SetServer called twice. first=0xc00017f0e0 second=0xc00017ea50
Feb 22 05:05:39 host nomad[15974]: goroutine 1 [running]:
Feb 22 05:05:39 host systemd[1]: Finished [email protected] - Renew smallstep TLS cert.
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*builtinAPI).SetServer(0xc0014447c0, 0xc00017ea50)
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent/http.go:551 +0xd9
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.NewHTTPServers(0xc000ea0000, 0xc00134d400)
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent/http.go:215 +0xebb
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).reloadHTTPServer(0xc0002b0000)
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent/command.go:963 +0x9b
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).handleReload(0xc0002b0000)
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent/command.go:1047 +0x695
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).handleSignals(0xc0002b0000)
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent/command.go:915 +0xfc
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc0002b0000, {0xc0000520a0, 0x6, 0x6})
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/command/agent/command.go:810 +0xf8a
Feb 22 05:05:39 host nomad[15974]: github.com/mitchellh/cli.(*CLI).Run(0xc0002f4f00)
Feb 22 05:05:39 host nomad[15974]: github.com/mitchellh/[email protected]/cli.go:262 +0x5f8
Feb 22 05:05:39 host nomad[15974]: main.Run({0xc000052090?, 0x7, 0x7})
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/main.go:107 +0x350
Feb 22 05:05:39 host nomad[15974]: main.main()
Feb 22 05:05:39 host nomad[15974]: github.com/hashicorp/nomad/main.go:77 +0x4e
Feb 22 05:05:39 host systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 22 05:05:39 host systemd[1]: nomad.service: Failed with result 'exit-code'.
Feb 22 05:05:39 host audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=nomad comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 22 05:05:39 host systemd[1]: nomad.service: Consumed 1h 8min 43.495s CPU time.
Feb 22 05:05:39 host kernel: audit: type=1131 audit(1677042339.875:8905): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=nomad comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 22 05:05:41 host systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.
Feb 22 05:05:41 host systemd[1]: Stopped nomad.service - Nomad.
The text was updated successfully, but these errors were encountered:
Thanks for trying out the beta and opening this issue @jdoss! This panic looks like it's in the new Task API HTTP server introduced recently in #15864. I'm going to tag-in @schmichael on this, because I know he's still looking at a few related items for GA.
No problem @tgross! Glad I can help. @schmichael this is very reproducible. I was able to get it to happen again on another node too. The only gotcha might be that I am running two nodes, both as servers and clients.
Thanks so much for the report @jdoss! I suspect it's the reloading (SIGHUP) that's triggering it. I'll get right on it and the fix will be in 1.5-rc1 coming out 🔜
Nomad version
Nomad v1.5.0-beta.1 (3d735e7)
Operating system and Environment details
Issue
I use smallstep's Certificate Manager to automate my internal PKI for my Nomad cluster and I am testing out Nomad v1.5.0-beta.1 and I noticed Nomad panicked after I renewed my short lived TLS cert and systemd restarted nomad.
Reproduction steps
Renew PKI TLS cert, hup nomad via systemd.
Expected Result
No panic.
Actual Result
A cute panic.
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: