panic: failed to apply request nomad.(*nomadFSM).Apply #10943

Mongey · 2021-07-26T22:48:23Z

Nomad version

v1.0.4 / v1.1.2

Operating system and Environment details

Linux ip-<redacted> 4.9.0-16-amd64 #1 SMP Debian 4.9.272-1 (2021-06-21) x86_64 GNU/Linux

Issue

While upgrading from v1.0.4. to v1.1.2 I encountered some issues, and lost the cluster leadership, requiring me to perform an Outage Recovery

We've got 3 server nodes and were replacing nodes via an aws instance-refresh, going from 3 -> 4, 30 min wait, back to 3,. Something like this

When a v1.1.2 node was introduced the v1.0.4 leader crashed with the following logs

panic: failed to apply request: []byte{0x2e, 0x84, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa3, 0x73, 0x74, 0x67, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xda, 0x0, 0x24, 0x30, 0x38, 0x32, 0x62, 0x63, 0x33, 0x35, 0x63, 0x2d, 0x63, 0x64, 0x63, 0x33, 0x2d, 0x65, 0x35, 0x30, 0x39, 0x2d, 0x31, 0x62, 0x65, 0x61, 0x2d, 0x37, 0x34, 0x62, 0x31, 0x63, 0x64, 0x62, 0x38, 0x30, 0x65, 0x65, 0x34, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc2}
--
goroutine 43 [running]:
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc0003c2460, 0xc000ff2190, 0x52c4d20, 0xc037da455feffd00)
github.com/hashicorp/nomad/nomad/fsm.go:315 +0x1b17
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc00036b480)
github.com/hashicorp/[email protected]/fsm.go:90 +0x2c2
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000af5800, 0x40, 0x40)
github.com/hashicorp/[email protected]/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc00105e000)
github.com/hashicorp/[email protected]/fsm.go:219 +0x3c4
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc00105e000, 0xc0013f2310)
github.com/hashicorp/[email protected]/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
github.com/hashicorp/[email protected]/state.go:144 +0x66

Reproduction steps

I haven't attempted to reproduce this yet.

Expected Result

No 💥 💣

Actual Result

panic.

Job file (if appropriate)

n/a

Nomad Server logs (if appropriate)

See above? (I can get more logs if required)

Nomad Client logs (if appropriate)

n/a

The text was updated successfully, but these errors were encountered:

Mongey · 2021-07-27T00:13:11Z

Interestingly it looks like the message was an empty structs.OneTimeToken?

Message ID  46
OneTimeSecretID: ''
AccessorID: ''
ExpiresAt: '0001-01-01 00:00:00 +0000 UTC'
CreateIndex: '0'
ModifyIndex: '0'

package main

import (
	"fmt"

	"github.com/hashicorp/nomad/nomad/structs"
)

func main() {
	buf := []byte{0x2e, 0x84, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa3, 0x73, 0x74, 0x67, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xda, 0x0, 0x24, 0x30, 0x38, 0x32, 0x62, 0x63, 0x33, 0x35, 0x63, 0x2d, 0x63, 0x64, 0x63, 0x33, 0x2d, 0x65, 0x35, 0x30, 0x39, 0x2d, 0x31, 0x62, 0x65, 0x61, 0x2d, 0x37, 0x34, 0x62, 0x31, 0x63, 0x64, 0x62, 0x38, 0x30, 0x65, 0x65, 0x34, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc2}

	msgType := structs.MessageType(buf[0])
	var req structs.OneTimeToken
	if err := structs.Decode(buf[1:], &req); err != nil {
		panic(fmt.Errorf("failed to decode request: %v", err))
	}
	fmt.Println("Message ID: ", msgType)
	fmt.Printf("OneTimeSecretID: '%s'\n", req.OneTimeSecretID)
	fmt.Printf("AccessorID: '%s'\n", req.AccessorID)
	fmt.Printf("ExpiresAt: '%s'\n", req.ExpiresAt)
	fmt.Printf("CreateIndex: '%d'\n", req.CreateIndex)
	fmt.Printf("ModifyIndex: '%d'\n", req.ModifyIndex)
}

🤔 I'm not sure how a OneTimeToken message would have been generated -- is it possible that was caused by someone using the UI, that hit a 1.1.2 node -- or is produced automatically?

notnoop · 2021-07-27T00:48:36Z

Yikes @Mongey ! That is very terrible indeed. We'll investigate and cut a patch bug fix shortly.

It looks like the nomad ui -authenticate feature backend introduced in 1.1 doesn't cope with mixed cluster mode well. I suspect that a 1.1.x server became a leader and issued a periodic cleanup job command in https://github.com/hashicorp/nomad/blob/v1.1.0/nomad/leader.go#L741-L744 . That command should by guarded with a check to ensure that all servers are 1.1.0 or later. Similarly, the endpoints in https://github.com/hashicorp/nomad/blob/v1.1.0/nomad/acl_endpoint.go#L843-L972 .

We will follow up with more update by tomorrow.

I'm very sorry that you ran into this issue and had the outage, and thank you for reaching out. We will follow up with more info shortly.

notnoop · 2021-07-27T15:56:09Z

Thanks @Mongey again. We have a fix! I was able to reproduce the behavior by having a 1.1.X leader and 1.0.X followers and issuing a nomad ui -authenticate or waiting ~10 minutes until it crashed.

Though, I couldn't reproduce the failure while 1.0.X servers remained the leaders, and I'm unable to discern the code that can trigger such failure. If that what occured in your environment, I'd love having the ~15 minutes logs of all server if possible. You can redact it and email it to [email protected] if that works better for you than posting it here.

Fix a panic in handling one-time auth tokens, used to support `nomad ui --authenticate`. If the nomad leader is a 1.1.x with some servers running as 1.0.x, the pre-1.1.0 servers risk crashing and the cluster may lose quorum. That can happen when `nomad authenticate -ui` command is issued, or when the leader scans for expired tokens every 10 minutes. Fixed #10943 .

github-actions · 2022-10-17T02:45:36Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mongey added the type/bug label Jul 26, 2021

notnoop self-assigned this Jul 27, 2021

notnoop mentioned this issue Jul 27, 2021

nomad: only activate one-time auth tokens with 1.1.0 #10952

Merged

notnoop closed this as completed in #10952 Jul 27, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: failed to apply request nomad.(*nomadFSM).Apply #10943

panic: failed to apply request nomad.(*nomadFSM).Apply #10943

Mongey commented Jul 26, 2021

Mongey commented Jul 27, 2021

notnoop commented Jul 27, 2021

notnoop commented Jul 27, 2021

github-actions bot commented Oct 17, 2022

panic: failed to apply request nomad.(*nomadFSM).Apply #10943

panic: failed to apply request nomad.(*nomadFSM).Apply #10943

Comments

Mongey commented Jul 26, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Mongey commented Jul 27, 2021

notnoop commented Jul 27, 2021

notnoop commented Jul 27, 2021

github-actions bot commented Oct 17, 2022