Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on nomad startup after upgrading to 1.8.1 #23385

Closed
tomqwpl opened this issue Jun 19, 2024 · 14 comments · Fixed by #23399
Closed

Panic on nomad startup after upgrading to 1.8.1 #23385

tomqwpl opened this issue Jun 19, 2024 · 14 comments · Fixed by #23399
Assignees
Milestone

Comments

@tomqwpl
Copy link

tomqwpl commented Jun 19, 2024

Nomad version

Output from nomad version
Nomad v1.8.1
BuildDate 2024-06-19T06:43:57Z
Revision 5022543

Operating system and Environment details

Windows 11

Issue

Nomad panics on startup having upgraded to 1.8.1, where it didn't before. Previously we were using 1.7.5.

Reproduction steps

Right now I haven't reduced this to a minimal configuration. Our configuration is reasonably complicated. I'm hoping that the stack trace might point things in the right direction.

Expected Result

No panic.

Actual Result

Panic:

goroutine 1287 [running]:
github.com/hashicorp/nomad/client/lib/idset.(*Set[...]).Slice(0xc001083d50?)
	github.com/hashicorp/nomad/client/lib/idset/idset.go:126 +0x1a
github.com/hashicorp/nomad/plugins/base.nomadTopologyToProto(0xc000a60120)
	github.com/hashicorp/nomad/plugins/base/base.go:171 +0x45
github.com/hashicorp/nomad/plugins/base.(*AgentConfig).toProto(0xc000e9c2e0)
	github.com/hashicorp/nomad/plugins/base/base.go:96 +0x47
github.com/hashicorp/nomad/plugins/base.(*BasePluginClient).SetConfig(0xc000e7ae60, 0xc000fe8ff0)
	github.com/hashicorp/nomad/plugins/base/client.go:63 +0x28
github.com/hashicorp/nomad/helper/pluginutils/loader.(*PluginLoader).Dispense(0xc0011083f0, {0xc001110258, 0x12}, {0x34cc4ff, 0x6}, 0xc000e9c2e0, {0x3dbc610, 0xc001060210})
	github.com/hashicorp/nomad/helper/pluginutils/loader/loader.go:186 +0x3d5
github.com/hashicorp/nomad/helper/pluginutils/singleton.(*SingletonLoader).dispense(0x0?, 0xc000cb7080, {0xc001110258?, 0x0?}, {0x34cc4ff?, 0x0?}, 0x0?, {0x3dbc610?, 0xc001060210?})
	github.com/hashicorp/nomad/helper/pluginutils/singleton/singleton.go:109 +0x53
created by github.com/hashicorp/nomad/helper/pluginutils/singleton.(*SingletonLoader).getPlugin in goroutine 272
	github.com/hashicorp/nomad/helper/pluginutils/singleton/singleton.go:85 +0x39d

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@pkazmierczak
Copy link
Contributor

Hi @tomqwpl, thanks for reporting this. Could you at least share the plugins config for your agents?

@tomqwpl
Copy link
Author

tomqwpl commented Jun 19, 2024

Nomad run as server and client combined, so like "dev" mode.

Partial config:

client {
  enabled = true
  template {
    disable_file_sandbox = true
  }
}

server {
  enabled          = true
  bootstrap_expect = 1
  }

plugin "raw_exec" {
  config {
    enabled = true
  }
}

plugin_dir = "our-custom-plugin-dir"


plugin "our-custom-plugin-name" {
    config {
       our-cusotm-plugin-settings
    }
}


Bunch of other stuff that may or may not be relevant but which I don't really want to post here.

@tomqwpl
Copy link
Author

tomqwpl commented Jun 19, 2024

Minimal config:

data_dir = "somewhere"

client {
  enabled = true
}

plugin_dir = "our-nomad-nomad-dir"

plugin "our-nomad-plugin" {
}

Running nomad with nomad agent -config xxx.hcl

Unclear whether the plugin itself matters, but that's not something I can share.

Anything changed in the requirements of a plugin in 1.8? I didn't see anything in the release notes.

@pkazmierczak
Copy link
Contributor

Anything changed in the requirements of a plugin in 1.8?

Can't think of anything but I'll look into it. Do you by any chance use the same config for any OS other than Windows? If yes, do you also have issues there, or is it limited to Windows?

@tomqwpl
Copy link
Author

tomqwpl commented Jun 19, 2024

Same config used on Linux, but I haven't verified whether the same issue occurs on Linux yet.

@tomqwpl
Copy link
Author

tomqwpl commented Jun 19, 2024

Doesn't appear to be an issue on 1.8.0. So appears to be new in 1.8.1.

I'm attempting to try out on Linux.

@tomqwpl
Copy link
Author

tomqwpl commented Jun 20, 2024

Same config appears to work OK on Linux, so this appears to be a Windows only issue

@tomqwpl
Copy link
Author

tomqwpl commented Jun 20, 2024

I can catch it in the debugger, but I don't know enough about nomad to know what here would be wrong:

func (s *Set[T]) Slice() []T 

s is nil

func (st *Topology) GetNodes() *idset.Set[hw.NodeID] {

returns nil because SupportsNUMA returns false (platform is Windows)

That appears to be the surface level error.

func (s *SingletonLoader) dispense(f *future, name, pluginType string, config *base.AgentConfig, logger log.Logger) {

the plugin name here is my custom plugin.

If I run the code at 1.8.0, then in nomadTopologyToProto:

NodeIds:                helper.ConvertSlice(top.NodeIDs.Slice(), func(id hw.NodeID) uint32 { return uint32(id) }),

here top.NodeIDs isn't nil, so this is fine.

The same line in 1.8.1 is:

	NodeIds:                helper.ConvertSlice(top.GetNodes().Slice(), func(id hw.NodeID) uint32 { return uint32(id) }),

The GetNodes method returns nil because of SupportsNuma being false. So this issue appears to have been caused by commit 2eda8d1?

@pkazmierczak
Copy link
Contributor

Wow, thanks for the investigation @tomqwpl, I'll get right on it.

@pkazmierczak pkazmierczak self-assigned this Jun 20, 2024
@pkazmierczak pkazmierczak added this to the 1.8.2 milestone Jun 20, 2024
@tgross tgross moved this to In Progress in Nomad - Community Issues Triage Jun 24, 2024
@pkazmierczak
Copy link
Contributor

hey @tomqwpl, I'm really struggling to reproduce the issue. No matter what I do I can't get my Nomad dev build (or the official Nomad 1.8.1) to panic. I think I found some buggy code path that I fixed here: #23399 but again—without knowing your plugin code it's very hard for me to test. Do you mind trying that branch on your plugin configuration? If you could share whether that affects your panic that'd be a good data point for me. Sorry for getting back to you so late.

@tomqwpl
Copy link
Author

tomqwpl commented Jun 25, 2024

Hmm. Interesting. I didn't imagine it would be plugin code specific.
It'll take me a couple of days to get to that, so you might have to bear with me. I might strip down a plugin and see if it makes a difference what plugin I use.

@tomqwpl
Copy link
Author

tomqwpl commented Jul 4, 2024

@pkazmierczak Yes, the lines of code you removed are the cause of the nil. So with that version I no longer get a panic.

@pkazmierczak
Copy link
Contributor

@tomqwpl good stuff! I'll add some tests to cover that code path on windows, and we'll try to release this in 1.8.2

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants