You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An option should be added to enable “lost” Nomad clients that reconnect to the server cluster to not restart their allocations.
Use-cases
This helps Nomad deployments that are in high latency environments with clients geographically distant from the server cluster. Nomad clients running with LTE connections (on IOT devices for instance) might regularly lose connection for minutes at a time. In these cases, when the Nomad client reconnects to the server cluster, ideally everything resumes functioning as normal.
Proposal Details
Currently, if a client fails to ping to server cluster in the heartbeat_grace period and if stop_after_client_disconnect is not set, the client allocation will continue running.
Under some conditions, a replacement allocation will be scheduled on a new client node. If a client node reconnects and a replacement allocation is running elsewhere (in this case, the total number of running allocations exceeds the expected count), ideally the client with the lowest affinity score would stop running. In the case of equal affinities, I think it makes sense for the original to continue, but I think random selection would be fine.
Under some conditions, a replacement allocation is not run (if no node matches the constraints). If this is the case, the node would ideally just reconnect and the allocation would not get restarted. Currently, it does restart.
I don’t think this would require any new configuration, but if some users want to keep the restart behavior, then a “restart_on_client_reconnect” boolean could be added to the job config.
Changing the meaning of lost
Allowing lost allocations to transition back to running is a significant change that breaks backward compatibility. To be clear I (@schmichael) think it's worth it, but it will take a considerable amount of testing and documentation to ensure a smooth transition.
Currently Allocation.ClientStatus=lost is a terminal state along with complete (intentionally stopped or a batch job that completed successfully) and failed (as determined by the restart policy).
The allocation.ClientStatus is used within the Nomad Autoscaler in a couple of places, none of which I believe will be adversely affect by this change. These places are the Nomad APM and the scaleutils node-selector which utilise this field in order to:
a) filter out allocation resources from utilisation totals
b) to filter nodes which have the least number of non-terminal allocations.
In the first situation, this new behaviour might cause some flapping during scale evaluations which could easily be counteracted by configuration and policy settings. The second situation has potential for nodes to be marked as empty and eligible for terminal; however, I believe the filtering of the node pool based on node status would protect against this. To a more high-level extent, I don't believe these environments described fit into an autoscaling environment.
In the situation described where a node becomes lost for some minutes and the autoscaler is enabled, it may be prudent to add additional safety checks around the stability of the job group before attempting to scale. Currently job groups in deployment are protected against allowing scaling to occur; whereas we might want to extend this to also include job groups where allocations are starting or stopped to replace lost/re-found allocations.
In a more general integration sense where allocations are consumed and processed either by blocking queries, ticked listing calls, or consumption of events; as detailed some mitigation may be needed depending on the application internal behaviour. That being said, if all updates trigger the correct API responses (an event is triggered whenever the alloc status changes), then a well behaviour processor should be able to correctly deal with such changes.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Proposal
An option should be added to enable “lost” Nomad clients that reconnect to the server cluster to not restart their allocations.
Use-cases
This helps Nomad deployments that are in high latency environments with clients geographically distant from the server cluster. Nomad clients running with LTE connections (on IOT devices for instance) might regularly lose connection for minutes at a time. In these cases, when the Nomad client reconnects to the server cluster, ideally everything resumes functioning as normal.
Proposal Details
Currently, if a client fails to ping to server cluster in the heartbeat_grace period and if stop_after_client_disconnect is not set, the client allocation will continue running.
Under some conditions, a replacement allocation will be scheduled on a new client node. If a client node reconnects and a replacement allocation is running elsewhere (in this case, the total number of running allocations exceeds the expected count), ideally the client with the lowest affinity score would stop running. In the case of equal affinities, I think it makes sense for the original to continue, but I think random selection would be fine.
Under some conditions, a replacement allocation is not run (if no node matches the constraints). If this is the case, the node would ideally just reconnect and the allocation would not get restarted. Currently, it does restart.
I don’t think this would require any new configuration, but if some users want to keep the restart behavior, then a “restart_on_client_reconnect” boolean could be added to the job config.
Changing the meaning of
lost
Allowing
lost
allocations to transition back torunning
is a significant change that breaks backward compatibility. To be clear I (@schmichael) think it's worth it, but it will take a considerable amount of testing and documentation to ensure a smooth transition.Currently
Allocation.ClientStatus=lost
is aterminal
state along withcomplete
(intentionally stopped or a batch job that completed successfully) andfailed
(as determined by therestart
policy).Everywhere that calls
Allocation.ClientTerminalStatus
will need to be audited for correctness.Any project, such as the Autoscaler, that relies on differentiating
terminal
from non-terminal allocation statuses also needs a migration plan.Reproduction Steps
When no replacement allocation is made:
Ideally, this would be the same allocation as before continuing to run.
With a replacement allocation
Ideally, the allocation with the better node rank would continue to run the allocation.
I think case one is more important than case two, as there could be some corner cases I'm not thinking about in case two.
The text was updated successfully, but these errors were encountered: