Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash when making script check after client restart #23875

Closed
sorenisanerd opened this issue Aug 26, 2024 · 5 comments · Fixed by #23917
Closed

crash when making script check after client restart #23875

sorenisanerd opened this issue Aug 26, 2024 · 5 comments · Fixed by #23917

Comments

@sorenisanerd
Copy link
Contributor

sorenisanerd commented Aug 26, 2024

Haven't had a chance to dig into it very much, but I got a bunch of these after upgrading from 1.7.7 to 1.8.3.

2024-08-25 20:24:02.951	panic: runtime error: invalid memory address or nil pointer dereference
2024-08-25 20:24:02.951	[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1b6a957]
2024-08-25 20:24:02.952	goroutine 2398 [running]:
2024-08-25 20:24:02.952	github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Exec(0x0, 0x2540be400, {0xc000af5780, 0x7}, {0xc0049166e0, 0x2, 0xc004554960?})
2024-08-25 20:24:02.952	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:70 +0xf7
2024-08-25 20:24:02.953	github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).processScript(0xc0012bf2c0, 0xc0047fc2c0, 0xc001186000?)
2024-08-25 20:24:02.953	github.com/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:595 +0x99
2024-08-25 20:24:02.953	created by github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).handleChangeModeScript in goroutine 2161
2024-08-25 20:24:02.953	github.com/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:563 +0x47

If I'm reading that stacktrace correctly, it happens here:

res, err := h.driver.ExecTask(h.taskID, command, timeout)

Both memory accesses in that line are relative to h, i.e. the *nomad.client.allocrunner.taskrunner.DriverHandle the method is bound to. The stack trace indicates that h is nil.

The call to .Exec is here (last line):

if tm.handle == nil {
failureMsg := fmt.Sprintf(
"Template failed to run script %v with arguments %v because task driver handle is not available",
script.Command,
script.Args,
)
tm.handleScriptError(script, failureMsg)
return
}
_, exitCode, err := tm.handle.Exec(script.Timeout, script.Command, script.Args)

I'm not sure how we end up with a tm.handle that is nil in .Exec when there's a check for it right before. Perhaps we just need to grab tm.handleLock?

@Juanadelacuesta
Copy link
Member

Hi @sorenisanerd! Thank you for reporting this issue, unfortunately we have not been able to replicate it, can you please provide some more information about it?

@tgross
Copy link
Member

tgross commented Sep 4, 2024

I suspect we're seeing this issue indirectly as a result of the work I did in #23663 which shipped in Nomad 1.8.3. It looks like the bug may have already existed but never got triggered because the handle was always nil during restarts instead of being restored.

The triggering scenario here is likely that the task handle was restored but then removed concurrently just as the change script was firing, so you end up with this weird case. And the fix looks like it should be to use the lock on the handle. I can push up a PR real quick on that, but it might take a little bit of effort to get it properly tested. @sorenisanerd if you have any notes on reproducing, that'd be helpful.

Edit: #23917

tgross added a commit that referenced this issue Sep 4, 2024
In #23663 we fixed the template hook so that `change_mode="script"` didn't lose
track of the task handle during restores. But this revealed a second bug which
is that access to the handle is not locked while in use, which can allow it to
be removed concurrently.

Fixes: #23875
@sorenisanerd
Copy link
Contributor Author

@tgross, this constituted a minor outage for me, so I didn't have the luxury of being able to debug properly. I frantically went through the jobs that appeared in the logs immediately before the crashes and changed their templates to change_mode = "noop", so they would follow a different code path. I haven't had a chance to change them back to see if the problem reappears.

FWIW, I've seen #15851 (referenced from #23663 that you mentioned) a bunch of times in the past. I guess they might be different manifestations of the same underlying issue.

Either way, I'll see if I can trigger it again. If I can do so reliably, I'll apply #23917 and see how it goes. Are there any working hypotheses that you'd like for me to test?

tgross added a commit that referenced this issue Sep 12, 2024
In #23663 we fixed the template hook so that `change_mode="script"` didn't lose
track of the task handle during restores. But this revealed a second bug which
is that access to the handle is not locked while in use, which can allow it to
be removed concurrently.

Fixes: #23875
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Sep 12, 2024
@tgross tgross changed the title 1.8.3 crash crash when making script check after client restart Sep 12, 2024
@tgross
Copy link
Member

tgross commented Sep 12, 2024

I've merged #23917 and that'll go out in the next release of Nomad (with backports for Nomad Enterprise). We still haven't figured out the code path that could trigger the NPE but the lock I've added should certainly take care of it.

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

6 participants
@sorenisanerd @tgross @Juanadelacuesta and others