-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation never starting ([DEBUG] client: added alloc "<id>" to blocked queue for previous allocation "<id>) #2711
Comments
Hey sorry you ran into this and thanks for the great write up! It should be fixed by this: 3018ae0 Which will be part of 0.6.0. You could rebase the commit and build 0.5.6 in the mean time! |
Hello, I did test commit 3018ae0 cherry-picked on top of version 0.5.6 and it seems to fix my problem. I had an automated test case that failed previously and that is now working. Perhaps there are still edge cases where long delay can happen, I don't know. Are you sure that you used the correct nomad executable with the patch included? |
By the way, I did not bother restarting the nomad servers with the patched version. Only having the clients running the patched version is enough. |
@mildred thanks for reply. |
I have deployed the nomad version with the fix, and it seems to work better, but I still have reports of blocked allocations on the same circumstances. I have to investigate a bit more and generate debug logs. |
New test, I have two allocations for the same job on the same node, one is probably blocking the other. Logs are saying : The blocking allocation is still pending destruction: allocation status:
node-status shows:
The nomad client running is available at: https://s3-eu-west-1.amazonaws.com/sqsc-release/nomad_v0.5.6%2B_linux_amd64.zip (it is version 0.5.6 patched with the commit mentioned). I checked with a digest that is the version I compiled manually. Full client logs in debug: https://gist.github.com/mildred/e625a8ed25cae4eae4a76ab112751ed4 |
@dadgar could you please look into this? it seems the fix commit does not solve the issue. If this is indeed the case, could the issue be reopened? Thank you. |
Hi @mildred please note last messages from my issue #2563 @schmichael provided binaries for test, I am going to try to reproduce my issue. It seems that our issues are connected, so It will be great if you try to reproduce your issue with that binaries too. |
ok, I'll run the tests with the binary |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If filing a bug please include the following:
Nomad version
Nomad v0.5.6
Operating system and Environment details
Agent running on coreos. using the official binary. Not running in a container.
Issue
After updating a job specification, the allocation generated is kept pending, never starting
Task is pending:
Allocation not recognized by the client:
Client logs are saying:
The alloc is blocked by a previous allocation. Let's look at this previous allocation, it appears pending too but also not needed due to job update:
About the blocking allocation, the client logs are showing:
We have a WARN and also alloc "23b1cb99-4ae9-540a-258e-dc246ecc494b" in terminal status, waiting for destroy. This alloc will never be destroyed.
Client node status shows this allocation, and also previous allocations still pending :
Detail about these allocations:
Reproduction steps
The environment is complex and not everything is open yet. But I am able to reproduce it any time.
It seems related to bad timing of job starting up. Like a job being reposted before the previous version of the job could complete.
Basically, here is my scenario:
Nomad Client / Server logs / Job file
Please note that client logs are in UTC and server logs in UTC+2
See gist: https://gist.github.com/mildred/8d2af00117d0379fe2fe9ce18a365c08
The text was updated successfully, but these errors were encountered: