-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash when making script check after client restart #23875
Comments
Hi @sorenisanerd! Thank you for reporting this issue, unfortunately we have not been able to replicate it, can you please provide some more information about it? |
I suspect we're seeing this issue indirectly as a result of the work I did in #23663 which shipped in Nomad 1.8.3. It looks like the bug may have already existed but never got triggered because the handle was always nil during restarts instead of being restored. The triggering scenario here is likely that the task handle was restored but then removed concurrently just as the change script was firing, so you end up with this weird case. And the fix looks like it should be to use the lock on the handle. I can push up a PR real quick on that, but it might take a little bit of effort to get it properly tested. @sorenisanerd if you have any notes on reproducing, that'd be helpful. Edit: #23917 |
@tgross, this constituted a minor outage for me, so I didn't have the luxury of being able to debug properly. I frantically went through the jobs that appeared in the logs immediately before the crashes and changed their templates to FWIW, I've seen #15851 (referenced from #23663 that you mentioned) a bunch of times in the past. I guess they might be different manifestations of the same underlying issue. Either way, I'll see if I can trigger it again. If I can do so reliably, I'll apply #23917 and see how it goes. Are there any working hypotheses that you'd like for me to test? |
I've merged #23917 and that'll go out in the next release of Nomad (with backports for Nomad Enterprise). We still haven't figured out the code path that could trigger the NPE but the lock I've added should certainly take care of it. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Haven't had a chance to dig into it very much, but I got a bunch of these after upgrading from 1.7.7 to 1.8.3.
If I'm reading that stacktrace correctly, it happens here:
nomad/client/allocrunner/taskrunner/driver_handle.go
Line 70 in bc90bd7
Both memory accesses in that line are relative to
h
, i.e. the*nomad.client.allocrunner.taskrunner.DriverHandle
the method is bound to. The stack trace indicates thath
isnil
.The call to
.Exec
is here (last line):nomad/client/allocrunner/taskrunner/template/template.go
Lines 586 to 595 in bc90bd7
I'm not sure how we end up with a
tm.handle
that isnil
in.Exec
when there's a check for it right before. Perhaps we just need to grabtm.handleLock
?The text was updated successfully, but these errors were encountered: