-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raw_exec job started after 2.5 hours after scheduled on windows #2563
Comments
This is how allocation is looking after jobs suddenly started
|
Can you give the output of: |
@capone212 That should fix it! Master should be bug free if you want to build and role out master! |
@dadgar thanks a lot for your quick fix!!! |
Hi @dadgar I have just noticed that unfortunately fix does not help. Good news through I can reproduce it easily now in my environment.
At that point roughly at 30-50% cases we will have task with pending state.
From logs:
From allocations
|
Please find full log and full allocations attached: @dadgar , If you need any help please let me know. As I sad, I can reproduce it and eager to test patches. |
And yes, after several hours, job finally started!
|
The essence of what is going on
|
Allocation started only after GC of previous allocation. |
Hmm thanks for the great repo steps! Will get this fixed by 0.6.0 and will update issue when we have a patch for you to test! |
Sorry for the delay, I've been trying to reproduce your bug to make sure it's fixed before we release 0.6. Unfortunately I've been unable to reproduce it. We've made a lot of changes to our state file handling (related to the Any chance you'd be able to test using these binaries? linux_amd64.zip If you still see the |
Hi @schmichael |
Hi @schmichael |
There is no "ailed to restore state for alloc" in log file. |
Plus some logging improvements that may help with #2563
Thanks for testing. I'm having a hard time reproducing this. Could you provide more details on how to reproduce this? It appears you're running (at least) a 3 node cluster with the node in question running as a client+server. While that's not a recommended configuration it shouldn't cause bugs. I'm also not sure what exactly happened in that last log file you posted as there's no long delay like in your initial report. I found and fixed a small race condition in somewhat related code, but I think think it could cause the original behavior you described. I'm planning to do a larger refactor of the alloc blocking logic but it's unlikely to make it into 0.6. Here's a build from the #2780 branch if you want to test: |
@schmichael I have tried the last binary you provided, but was able to reproduce the same problem with it. I have invested my weekend for understanding what is going on, and I think I found the bug. Looking forward for your feedback on proposed solution. |
#2563 fixed pending state for allocations with terminal status
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.5.4
Operating system and Environment details
5 windows boxes = 3 servers + 2 clients
Issue
After nomad client restart, some of scheduled jobs on that client was in pending state for more than two hours, without any attempt to start the jobs. This is severe issue for us and we would like to have it fixed ASAP.
Reproduction steps
I don't have exact steps. We know that this sometimes occurs after client restart . But we definitely faced with this problem more than once.
Nomad Client logs (if appropriate)
The full nomad client logs here
https://gist.github.com/capone212/94f47912f1f9e700195d9b45e846b7e3
On of problem jobs is "ngprunner.Node001"
When problem was reproducing i made following observations
You can see that allocation 2748409b in pending state. But there were no attempt to start or restart that allocation on host machine.
This is extraction from nomad.log about the job
Allocation 2748409b was in "blocked queue". And after more than two hours allocation suddenly started
The text was updated successfully, but these errors were encountered: