-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually drained node got resumed after watchdog timer was unable to terminate hung NHC process #134
Comments
I've been looking at this, and something very odd is going on here. Before I get into that, though, I would highly recommend that you upgrade NHC, at least to the 1.4.3 release. The 1.4.2 release lacks numerous improvements, enhancements, and fixes; the 1.4.3 release came out just over a year ago and has been used reliably at numerous large centers, including LANL in particular. If you're brave (or have a testbed to play with), I'd encourage you to at least try the most recent code or the primary development branch for the upcoming 1.5 release. (The former is just an unmerged branch off the latter; they're almost identical, except that the unmerged branch contains a fix -- I hope -- for #104.) I obviously can't guarantee upgrading will fix the issue you're seeing, but I can tell you it has a lot more robust code for handling Slurm and its various node states. As for the situation above... Was anything left out of the log, or is that a complete, unedited excerpt? I ask because the 7th line shows In fact, I make it a point to try to do as little hostname manipulation as possible; it's very easy to wind up with a decidedly confused Slurm and/or NHC! 😁 I see other lines with that same mismatch, so I'm not sure what's going on. And it may not be significant anyway...just something that stood out. I also noticed that the 8th line has another anomaly: Notice the extra space between "Marking" and the hostname? That message comes from node-mark-offline:82, so somehow the So you see why I'm so confused! 😄 If you're able to shed any light on the above, that might help me figure out what transpired. Or, if you're willing, try one of the upgrade recommendations above. Barring that (and assuming you can easily reproduce the above behavior!), if you could change Good luck; I hope this helps! |
Hi @mej Thanks for the reply. Yes this is indeed an unedited excerpt directly from I can't explain the hostname mismatch either, just looked through the code, and HOSTNAME_S is seemingly almost never used, and our nhc.conf isn't overwriting that in any way either. The issue is, that this has been a one-time occurance so far. I can't really reproduce it, because with regular NHC errors, it doesn't overwrite the note, see here:
But even here, there's the issue that you mention of the hostname mismatch. Overall I mostly just wanted to ask here to see if this is something that has happened before, and if it hasn't happened before, whether it's actually a bug of the script itself or if this is a weird oddity in our setup. I'm happy with your advice about just upgrading NHC. |
Does it happen reliably when the watchdog timer times out? If so, reducing the timeout or increasing the checks' runtime should be able to trigger it. Barring that, the good news is that whatever alteration is going on, it appears to be contained within the Unless we can find a scenario that reproduces the problem reliably, chances are that the only way to track this thing down is going to be adding the
Yes, very strange.
I have never observed this behavior myself, nor has anyone else reported it to me, to the best of my recollection. As far as what is to blame, I'm really not sure. In theory, the
I think it's worth a try. For one thing, I just rewrote a major chunk of the watchdog timer code, so troubleshooting the old code would be of less value in the long term than testing out the newer, more robust (hopefully) code. I know it can be a lot to ask -- replacing stable, well-seasoned code with newer, less tested code -- so if that's not an option, I totally understand. 😃 And I'm not one of those "run the latest code or I can't help you" type of people, so don't worry about that! Additionally, I already have an open Issue, #129, for figuring out how to improve the hostname situation. One of the most common problems I do hear about is handling mismatched hostnames -- where the canonical hostname in the kernel UTS namespace (i.e., |
Hi everyone
We are running lbnl-nhc-1.4.2-1.el7 on CentOS 7 with SLURM 20.11.9 and recently had a problematic node, that we manually drained. Suddenly we noticed, that it was failing a bunch of jobs and was back in alloc again. So we found this in the log:
As you can see from these logs, the server got resumed by NHC, because it overwrote the manual non-NHC note due to it being unable to terminate the hung NHC process, after which it no longer found anything wrong with the server, which removed the now-belonging-to-NHC note about the watchdog timer.
I read through the code, and something like this seemingly isn't anticipated, is this considered an anomaly that shouldn't happen?
The text was updated successfully, but these errors were encountered: