You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
centos 7 3.10.0-327.10.1.el7.x86_64
Docker version 1.10.3, build 20f81dd
Issue
When I have a batch job that has allocations that can't immediately be placed, those allocations are marked failed. Eventually new allocations are created. The queue works great in this case, but the allocation history becomes difficult to manage.
Often, there are hundreds to thousands of failed allocations, with "failed to find a node for placement". This makes status calls to the HTTP api take lots of time to return, sometimes with results that are 8 MB in size. This isn't an issue for colocated queries, but remotely it can be a headache.
Would it make more sense to just keep the allocations pending, or maybe pendingPlacement? And then in the TaskStates, add an event entry that says "failed placement" with the last time placement was tried?
Hey @g0t4, thanks for bringing this up. We have some ideas to fix this and they will land in 0.4. The failed allocations contain debug information that we will be moving to the evaluation.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
v0.3.1
Operating system and Environment details
centos 7 3.10.0-327.10.1.el7.x86_64
Docker version 1.10.3, build 20f81dd
Issue
When I have a batch job that has allocations that can't immediately be placed, those allocations are marked failed. Eventually new allocations are created. The queue works great in this case, but the allocation history becomes difficult to manage.
Often, there are hundreds to thousands of failed allocations, with "failed to find a node for placement". This makes status calls to the HTTP api take lots of time to return, sometimes with results that are 8 MB in size. This isn't an issue for colocated queries, but remotely it can be a headache.
Would it make more sense to just keep the allocations pending, or maybe pendingPlacement? And then in the TaskStates, add an event entry that says "failed placement" with the last time placement was tried?
FYI, I brought this up on the mailing list too: https://groups.google.com/forum/#!topic/nomad-tool/LcvMgHN_RPU
Reproduction steps
Create a batch job with enough tasks to saturate all nodes in a cluster, with some remaining allocations that can't be immediately placed.
The text was updated successfully, but these errors were encountered: