-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: batch job reap failed: error="rpc error: Permission denied" #8658
Comments
Hi @alexdulin!
Just for clarity, are these batch jobs ones that had CSI volumes attached? |
No they are not. We have not had CSI volumes/plugins registered for a long time (since it was initially released). After trying out CSI volumes we started to receive many errors that only finally stopped after the 0.12.2 release (thank you very much for that by the way), but those were different than what I posted above. |
Ok, thanks. We'll look into the ACL errors you're seeing. |
I just took a quick pass through the section of code involved. The GC job passes the leader's ACL token to the batch deregister RPC. Resolving the token can't raise a "permission denied", only a "token not found" (and even that can't be hit if we have the legit leader ACL token as expected). So it looks the permission denied error is bubbling up from Is this cluster running Nomad Enterprise by chance? Or if not, did this cluster used to be running Nomad Enterprise? |
No, its running the open source version. Never was enterprise. |
Understood. The log entry says it's trying to reap 6 jobs:
Give that you said this has been going on for a while, do you have a particular set of jobs that you'd expect to have been garbage collected but aren't getting reaped? If so, would you be willing to share the results of the Job API against some of those jobs? It might help to reproduce the issue at our end. |
At one point I tried to fix this by deleting all of our batch jobs, waiting for them to be fully GC'ed (waited 72 hours), and then seeing if the logs continued, and they did. Since then we have re-enabled the jobs, but there is no specific set that would strike me as causing this. Our of curiosity, do forced garbage collection get considered as a batch job? The warnings that are being logged by the raft leader seem to indicate that forced GCs are at least a job in terms of an evaluation:
When we were having the initial CSI errors, GCs were failing due to volumes still be associated with allocations (nodes failed to drain too, which has been documented in other issues). Is it possible these six "batch jobs" are six failed force GCs that are stuck in the bolt database or something? |
That's a good insight! Almost: Nomad runs GC tasks as "core" jobs, which are similar to batch jobs but their evaluations result in internal functions being executed instead of running workloads on clients. But yes they do result in evaluations we have to process. So from the logs what I'm seeing is the following:
The log entry there is maybe a little misleading in wording. The thing that failed is a "reaping a batch of jobs", not "reaping a batch job". The ACL the core jobs get is a "management" token. I'll see if I can get the ACLs into a state where the management token is rejected for some reason. |
This might be unrelated to the current issue, but after upgrading to 0.12.3 we are now seeing CSI errors in the logs again:
That is from a follower server. We do not have any CSI plugins running or any CSI volumes registered, but we do still have many old CSI related evaluations left over from when we first tried it out, like the following: {
"ID": "ff91f2c4-374e-f2f6-81a0-9b8e69543875",
"Namespace": "default",
"Priority": 200,
"Type": "_core",
"TriggeredBy": "failed-follow-up",
"JobID": "csi-volume-claim-gc:[REDACTED]:no",
"Status": "pending",
"Wait": 341929846579,
"PreviousEval": "a708d4e5-0a22-ab2c-d494-18d70209131f",
"CreateIndex": 6497734,
"ModifyIndex": 6497734,
"CreateTime": 1597254886540115403,
"ModifyTime": 1597254886540115403
} And after upgrading the rest of the cluster to 0.12.3, once a new leader was elected this started to show up in its logs:
|
That Normally when an eval fails more than the delivery limit, the leader will create a new eval with the But evaluations for core jobs carry the leader's ACL, which is not valid on another leader after an election. The I've got #8682 open with a failing test that demonstrates the behavior, and I'm working up a fix for it now. |
That is very interesting and makes sense why these old evaluation have been hanging around for so long. Looking forward to the fix! |
This fix for this will ship in the upcoming Nomad 0.12.4 |
Amazing, thank you for fixing this so quickly @tgross. Appreciate the effort. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.12.2 (ee69b33)
Operating system and Environment details
Linux ip-172-31-63-158 4.15.0-1079-aws #83~16.04.1-Ubuntu SMP Tue Jul 14 15:31:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Issue
Nomad servers are failing to reap jobs due to RPC errors. We have been observing these errors in the logs since 0.11.0 and I was hoping that the fixes to CSI issues in 0.12.2 might fix them, but it does not appear that is the case. It is not clear that the warnings from the leader's logs are related to what is showing in the follower servers, but they occur at the same frequency and have not stopped in months.
The cluster has full ACLs, TLS, and RPC encryption in place.
Nomad Server logs (if appropriate)
Logs from follower servers
Logs from leader
The text was updated successfully, but these errors were encountered: