Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: failed to apply request nomad.(*nomadFSM).Apply #10943

Closed
Mongey opened this issue Jul 26, 2021 · 4 comments · Fixed by #10952
Closed

panic: failed to apply request nomad.(*nomadFSM).Apply #10943

Mongey opened this issue Jul 26, 2021 · 4 comments · Fixed by #10952
Assignees
Labels

Comments

@Mongey
Copy link
Contributor

Mongey commented Jul 26, 2021

Nomad version

v1.0.4 / v1.1.2

Operating system and Environment details

Linux ip-<redacted> 4.9.0-16-amd64 #1 SMP Debian 4.9.272-1 (2021-06-21) x86_64 GNU/Linux

Issue

While upgrading from v1.0.4. to v1.1.2 I encountered some issues, and lost the cluster leadership, requiring me to perform an Outage Recovery

We've got 3 server nodes and were replacing nodes via an aws instance-refresh, going from 3 -> 4, 30 min wait, back to 3,. Something like this
Screenshot 2021-07-26 at 23 28 59

When a v1.1.2 node was introduced the v1.0.4 leader crashed with the following logs

panic: failed to apply request: []byte{0x2e, 0x84, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa3, 0x73, 0x74, 0x67, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xda, 0x0, 0x24, 0x30, 0x38, 0x32, 0x62, 0x63, 0x33, 0x35, 0x63, 0x2d, 0x63, 0x64, 0x63, 0x33, 0x2d, 0x65, 0x35, 0x30, 0x39, 0x2d, 0x31, 0x62, 0x65, 0x61, 0x2d, 0x37, 0x34, 0x62, 0x31, 0x63, 0x64, 0x62, 0x38, 0x30, 0x65, 0x65, 0x34, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc2}
--
goroutine 43 [running]:
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc0003c2460, 0xc000ff2190, 0x52c4d20, 0xc037da455feffd00)
github.com/hashicorp/nomad/nomad/fsm.go:315 +0x1b17
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc00036b480)
github.com/hashicorp/[email protected]/fsm.go:90 +0x2c2
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000af5800, 0x40, 0x40)
github.com/hashicorp/[email protected]/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc00105e000)
github.com/hashicorp/[email protected]/fsm.go:219 +0x3c4
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc00105e000, 0xc0013f2310)
github.com/hashicorp/[email protected]/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
github.com/hashicorp/[email protected]/state.go:144 +0x66

Reproduction steps

I haven't attempted to reproduce this yet.

Expected Result

No 💥 💣

Actual Result

panic.

Job file (if appropriate)

n/a

Nomad Server logs (if appropriate)

See above? (I can get more logs if required)

Nomad Client logs (if appropriate)

n/a

@Mongey
Copy link
Contributor Author

Mongey commented Jul 27, 2021

Interestingly it looks like the message was an empty structs.OneTimeToken?

Message ID  46
OneTimeSecretID: ''
AccessorID: ''
ExpiresAt: '0001-01-01 00:00:00 +0000 UTC'
CreateIndex: '0'
ModifyIndex: '0'
package main

import (
	"fmt"

	"github.com/hashicorp/nomad/nomad/structs"
)

func main() {
	buf := []byte{0x2e, 0x84, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa3, 0x73, 0x74, 0x67, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xda, 0x0, 0x24, 0x30, 0x38, 0x32, 0x62, 0x63, 0x33, 0x35, 0x63, 0x2d, 0x63, 0x64, 0x63, 0x33, 0x2d, 0x65, 0x35, 0x30, 0x39, 0x2d, 0x31, 0x62, 0x65, 0x61, 0x2d, 0x37, 0x34, 0x62, 0x31, 0x63, 0x64, 0x62, 0x38, 0x30, 0x65, 0x65, 0x34, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc2}

	msgType := structs.MessageType(buf[0])
	var req structs.OneTimeToken
	if err := structs.Decode(buf[1:], &req); err != nil {
		panic(fmt.Errorf("failed to decode request: %v", err))
	}
	fmt.Println("Message ID: ", msgType)
	fmt.Printf("OneTimeSecretID: '%s'\n", req.OneTimeSecretID)
	fmt.Printf("AccessorID: '%s'\n", req.AccessorID)
	fmt.Printf("ExpiresAt: '%s'\n", req.ExpiresAt)
	fmt.Printf("CreateIndex: '%d'\n", req.CreateIndex)
	fmt.Printf("ModifyIndex: '%d'\n", req.ModifyIndex)
}

🤔 I'm not sure how a OneTimeToken message would have been generated -- is it possible that was caused by someone using the UI, that hit a 1.1.2 node -- or is produced automatically?

@notnoop
Copy link
Contributor

notnoop commented Jul 27, 2021

Yikes @Mongey ! That is very terrible indeed. We'll investigate and cut a patch bug fix shortly.

It looks like the nomad ui -authenticate feature backend introduced in 1.1 doesn't cope with mixed cluster mode well. I suspect that a 1.1.x server became a leader and issued a periodic cleanup job command in https://github.com/hashicorp/nomad/blob/v1.1.0/nomad/leader.go#L741-L744 . That command should by guarded with a check to ensure that all servers are 1.1.0 or later. Similarly, the endpoints in https://github.com/hashicorp/nomad/blob/v1.1.0/nomad/acl_endpoint.go#L843-L972 .

We will follow up with more update by tomorrow.

I'm very sorry that you ran into this issue and had the outage, and thank you for reaching out. We will follow up with more info shortly.

@notnoop
Copy link
Contributor

notnoop commented Jul 27, 2021

Thanks @Mongey again. We have a fix! I was able to reproduce the behavior by having a 1.1.X leader and 1.0.X followers and issuing a nomad ui -authenticate or waiting ~10 minutes until it crashed.

Though, I couldn't reproduce the failure while 1.0.X servers remained the leaders, and I'm unable to discern the code that can trigger such failure. If that what occured in your environment, I'd love having the ~15 minutes logs of all server if possible. You can redact it and email it to [email protected] if that works better for you than posting it here.

notnoop pushed a commit that referenced this issue Jul 27, 2021
Fix a panic in handling one-time auth tokens, used to support `nomad ui
--authenticate`.

If the nomad leader is a 1.1.x with some servers running as 1.0.x, the
pre-1.1.0 servers risk crashing and the cluster may lose quorum. That
can happen when `nomad authenticate -ui` command is issued, or when the
leader scans for expired tokens every 10 minutes.

Fixed #10943 .
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants