-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alloc.GetAlloc fails on rebooted client #10200
Comments
This log entry is interesting:
That comes from
Are these allocations replacements for system jobs on this client, or are they newly-scheduled jobs that are being placed but not starting successfully? Are there logs on the client that give any indication what's happening on the client, or task events for those tasks? |
@tgross The logs on the client essentially showed allocs being GC'd successfully, then ended up producing the above log every few minutes. I can provide more comprehensive logs via email if needed. We essentially have one large job running on this server with 20 or so tasks. After the reboot, it seemed like the tasks resumed without being lost, however the "UUID must be 36 characters" line was triggering our error detection, so we decided to try restarting the job to fix that state. It was at that point that the allocations were stuck in a "pending" state with no task events showing in the UI. We also tried purging and starting the job to no avail, the same issue was occurring. To resolve this issue, we ended up removing the allocations manually from /var/nomad/alloc, and starting the job afterwards. |
That's interesting... the client doesn't use the presence of those directories to determine if it needs to update the server with the status. The client has its own local state store that it syncs to disk. I was thinking that maybe that state store was somehow corrupt but that wouldn't explain why removing the allocation directories would help at all. It would definitely be useful to get the logs from both the client and the servers around the time of the reboot. That might give more clues as to what happened here. |
Thanks Tim, will try to gather the logs from the relevant time period and post them here. One thing we've been experimenting with that might be slightly different than most deployments is that we have Nomad clients with IPs that change frequently, and we're running local Consul servers on the same nodes as some of the Nomad clients due to limited hardware availability. This means that when the nodes rebooted, some of the Consul servers would've temporarily lost quorum + the IP addresses of the Nomad clients (and Consul service registration IDs) could've changed after reboot. Wondering if these factors could potentially contribute to corrupted local state? Will upload the logs here once I get a chance to gather them. |
We have had this issue a few times, most recently today. We're on nomad 1.0.4 and had a job with a single group with a single task cause this. It has a template pulling values from vault (1.7.1) as well. The job file submitted was identical to the previously running job file other than a meta tag we update for when it was submitted. The job was allocated on a different host than the previously running the job as well FWIW.
|
@jlambertjc wrote:
Hi @jlambertjc! This doesn't look like the same issue at all, other than it's happening on the same RPC endpoint. Looks to me here like you've got an ACLs problem. Can you please open a new issue? |
@sgtang wrote:
Ah... the server keeps a mapping of Node IDs to IP addresses. If you're changing the IP addresses of the clients, this is likely to cause some issues with client-to-server communication. We might be able to provide some guidance here if we had a better understanding of why the IPs of the clients need to change. |
@tgross Thanks for that info re: the Node ID to IP address mapping. The reason why our Nomad clients change IP addresses occasionally is due to our VPN redundancy set up, essentially our nodes can connect to any one of 3 VPN gateways and the client IP address might change depending on the subnet of the gateway it's connected to. We essentially have a hook to restart Nomad every time the address on the tunnel interface changes. Wondering if you guys have experience with similar set ups over VPN? |
The clients are remote from the servers in this case? We've seen folks do that, but I can't say that I've seen quite the setup you've got with the changing gateway subnets. Typically when I've seen that, the VPN gateways share a pool of IPs. What does the internal topology look like for reaching the servers? Or in other words, when clients connect through the VPN, do they hit a specific server depending on which gateway subnet they're on? Is there a load balancer with a floating server IP? An A-Record that clients are using to reach the address? I think with enough information we can build a little test environment that exercises the behavior to make sure we have a good understanding of what's happening. In the meantime, have you tuned the |
The clients are remote from the servers, and we have multiple clients in various physical locations. We use Consul as a backend, so no A-record/load balancer for the servers, just discovery through service registration, and the clients can talk to all of our cloud Nomad servers in our single DC, regardless of the VPN gateway they connect to. The VPN gateways are in the same cloud subnet as the servers, each managing a different set of IPs that they assign to clients. The issue is that the Nomad clients bind to the IPs assigned by the VPNs, and the subnet for the IPs differs per gateway to avoid routing conflicts. We are using the HA/failover config described in this article from OpenVPN: https://openvpn.net/community-resources/implementing-a-load-balancing-failover-configuration/ which essentially recommends separate servers with different virtual IP pools. If it's a strict requirement for the Nomad client IPs to stay the same, we can try looking at ways to keep static client IPs across the gateways. We have tuned heartbeat_grace to account for the restarts. Thanks for the suggestion + the help |
Fixes #10200 **The bug** A user reported receiving the following error when an alloc was placed that needed to preempt existing allocs: ``` [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" ``` The previous alloc (8e) was already complete on the client. This is possible if an alloc stops *after* the scheduling decision was made to preempt it, but *before* the node running both allocations was able to pull and start the preemptor. While that is hopefully a narrow window of time, you can expect it to occur in high throughput batch scheduling heavy systems. However the RPC error made no sense! `previous_alloc` in the logs was a valid 36 character UUID! **The fix** The fix is: ``` - prevAllocID: c.Alloc.PreviousAllocation, + prevAllocID: watchedAllocID, ``` The alloc watcher new func used for preemption improperly referenced Alloc.PreviousAllocation instead of the passed in watchedAllocID. When multiple allocs are preempted, a watcher is created for each with watchedAllocID set properly by the caller. In this case Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters` error was coming from! Sadly we were properly referencing watchedAllocID in the log, so it made the error make no sense! **The repro** I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl) and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl) for ease of repro. First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad), then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad) that evicts 2 low priority jobs. Everything worked as expected. However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147), it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.
Fixes #10200 **The bug** A user reported receiving the following error when an alloc was placed that needed to preempt existing allocs: ``` [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" ``` The previous alloc (8e) was already complete on the client. This is possible if an alloc stops *after* the scheduling decision was made to preempt it, but *before* the node running both allocations was able to pull and start the preemptor. While that is hopefully a narrow window of time, you can expect it to occur in high throughput batch scheduling heavy systems. However the RPC error made no sense! `previous_alloc` in the logs was a valid 36 character UUID! **The fix** The fix is: ``` - prevAllocID: c.Alloc.PreviousAllocation, + prevAllocID: watchedAllocID, ``` The alloc watcher new func used for preemption improperly referenced Alloc.PreviousAllocation instead of the passed in watchedAllocID. When multiple allocs are preempted, a watcher is created for each with watchedAllocID set properly by the caller. In this case Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters` error was coming from! Sadly we were properly referencing watchedAllocID in the log, so it made the error make no sense! **The repro** I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl) and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl) for ease of repro. First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad), then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad) that evicts 2 low priority jobs. Everything worked as expected. However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147), it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.
I think #12779 will fix this issue. Even if it doesn't fit it entirely, it will at least fix the strange Please don't hesitate to reopen this issue (or open a new one) if this issue persists after 1.3.0-rc1, 1.2.7, and 1.1.13 are released. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.0.1 (c9c68aa)
Operating system and Environment details
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
Issue
Nomad client is not successfully garbage collecting old allocations after a node reboot, client logs reference a UUID index error.
Reproduction steps
This occurred while rebooting a Nomad client without draining jobs first.
Other Nomad clients were rebooted as well, however this is the only node that had issues, and the first time we've seen these logs.
Expected Result
Nomad client reboots, old allocations are gc'd, new allocations are rescheduled successfully.
Actual Result
Old allocations are not completely gc'd, new allocations on this node are stuck on pending.
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: