Alloc.GetAlloc fails on rebooted client #10200

sgtang · 2021-03-19T13:59:51Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian

Issue

Nomad client is not successfully garbage collecting old allocations after a node reboot, client logs reference a UUID index error.

Reproduction steps

This occurred while rebooting a Nomad client without draining jobs first.
Other Nomad clients were rebooted as well, however this is the only node that had issues, and the first time we've seen these logs.

Expected Result

Nomad client reboots, old allocations are gc'd, new allocations are rescheduled successfully.

Actual Result

Old allocations are not completely gc'd, new allocations on this node are stuck on pending.

Nomad Client logs (if appropriate)

2021-03-18T17:57:47.776Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" rpc=Alloc.GetAlloc server=:4647
2021-03-18T17:57:47.776Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: alloc lookup failed: index error: UUID must be 36 characters"
rpc=Alloc.GetAlloc server=:4647
2021-03-18T17:57:47.776Z [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=b704b676-44b1-d456-783c-20e9e2285fd5 previous_alloc=48d93821-60db-a17d-9561-be3571a9a776 error="rpc error: alloc lookup failed: index error: UUID must be 36 characters"

The text was updated successfully, but these errors were encountered:

tgross · 2021-03-24T19:47:55Z

This log entry is interesting:

client.alloc_watcher: error querying previous alloc

That comes from alloc_watcher.go#L394, which is called when the client is waiting for an allocation to be terminated. Was the client out of touch long enough for all it's allocs to be marked lost?

new allocations on this node are stuck on pending.

Are these allocations replacements for system jobs on this client, or are they newly-scheduled jobs that are being placed but not starting successfully? Are there logs on the client that give any indication what's happening on the client, or task events for those tasks?

sgtang · 2021-03-24T21:29:08Z

@tgross The logs on the client essentially showed allocs being GC'd successfully, then ended up producing the above log every few minutes. I can provide more comprehensive logs via email if needed.

We essentially have one large job running on this server with 20 or so tasks. After the reboot, it seemed like the tasks resumed without being lost, however the "UUID must be 36 characters" line was triggering our error detection, so we decided to try restarting the job to fix that state. It was at that point that the allocations were stuck in a "pending" state with no task events showing in the UI. We also tried purging and starting the job to no avail, the same issue was occurring.

To resolve this issue, we ended up removing the allocations manually from /var/nomad/alloc, and starting the job afterwards.

tgross · 2021-05-04T13:06:40Z

To resolve this issue, we ended up removing the allocations manually from /var/nomad/alloc, and starting the job afterwards.

That's interesting... the client doesn't use the presence of those directories to determine if it needs to update the server with the status. The client has its own local state store that it syncs to disk. I was thinking that maybe that state store was somehow corrupt but that wouldn't explain why removing the allocation directories would help at all.

It would definitely be useful to get the logs from both the client and the servers around the time of the reboot. That might give more clues as to what happened here.

sgtang · 2021-05-06T15:04:59Z

Thanks Tim, will try to gather the logs from the relevant time period and post them here.

One thing we've been experimenting with that might be slightly different than most deployments is that we have Nomad clients with IPs that change frequently, and we're running local Consul servers on the same nodes as some of the Nomad clients due to limited hardware availability.

This means that when the nodes rebooted, some of the Consul servers would've temporarily lost quorum + the IP addresses of the Nomad clients (and Consul service registration IDs) could've changed after reboot. Wondering if these factors could potentially contribute to corrupted local state?

Will upload the logs here once I get a chance to gather them.

jlambertjc · 2021-05-13T21:38:22Z

We have had this issue a few times, most recently today. We're on nomad 1.0.4 and had a job with a single group with a single task cause this. It has a template pulling values from vault (1.7.1) as well. The job file submitted was identical to the previously running job file other than a meta tag we update for when it was submitted. The job was allocated on a different host than the previously running the job as well FWIW.

{"@Level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2021-05-13T20:52:20.946081Z","error":"rpc error: ACL token not found","rpc":"Alloc.GetAlloc","server":{"IP":"10.20.2.132","Port":4647,"Zone":""}}
{"@Level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2021-05-13T20:52:20.946162Z","error":"rpc error: ACL token not found","rpc":"Alloc.GetAlloc","server":{"IP":"10.20.2.132","Port":4647,"Zone":""}}
{"@Level":"error","@message":"error querying previous alloc","@module":"client.alloc_migrator","@timestamp":"2021-05-13T20:52:20.946186Z","alloc_id":"e37c4285-aebb-ab4b-9e1f-e0ef7276474d","error":"rpc error: ACL token not found","previous_alloc":"42943055-c074-5deb-ce8d-b85d694f0672"}

tgross · 2021-06-03T15:12:49Z

@jlambertjc wrote:

We have had this issue a few times, most recently today. We're on nomad 1.0.4 and had a job with a single group with a single task cause this. It has a template pulling values from vault (1.7.1) as well. The job file submitted was identical to the previously running job file other than a meta tag we update for when it was submitted. The job was allocated on a different host than the previously running the job as well FWIW.

Hi @jlambertjc! This doesn't look like the same issue at all, other than it's happening on the same RPC endpoint. Looks to me here like you've got an ACLs problem. Can you please open a new issue?

tgross · 2021-06-03T15:12:52Z

@sgtang wrote:

One thing we've been experimenting with that might be slightly different than most deployments is that we have Nomad clients with IPs that change frequently,

Ah... the server keeps a mapping of Node IDs to IP addresses. If you're changing the IP addresses of the clients, this is likely to cause some issues with client-to-server communication. We might be able to provide some guidance here if we had a better understanding of why the IPs of the clients need to change.

sgtang · 2021-06-15T17:21:03Z

@tgross Thanks for that info re: the Node ID to IP address mapping. The reason why our Nomad clients change IP addresses occasionally is due to our VPN redundancy set up, essentially our nodes can connect to any one of 3 VPN gateways and the client IP address might change depending on the subnet of the gateway it's connected to. We essentially have a hook to restart Nomad every time the address on the tunnel interface changes.

Wondering if you guys have experience with similar set ups over VPN?

tgross · 2021-06-16T17:17:59Z

The clients are remote from the servers in this case? We've seen folks do that, but I can't say that I've seen quite the setup you've got with the changing gateway subnets. Typically when I've seen that, the VPN gateways share a pool of IPs.

What does the internal topology look like for reaching the servers? Or in other words, when clients connect through the VPN, do they hit a specific server depending on which gateway subnet they're on? Is there a load balancer with a floating server IP? An A-Record that clients are using to reach the address? I think with enough information we can build a little test environment that exercises the behavior to make sure we have a good understanding of what's happening.

In the meantime, have you tuned the min_heartbeat_ttl or heartbeat_grace to account for the restarts? I'm not sure that's totally prevent this kind of problem but it could reduce the risk that the client is marked out-of-contact and has its allocations evicted.

sgtang · 2021-06-21T17:18:56Z

The clients are remote from the servers, and we have multiple clients in various physical locations. We use Consul as a backend, so no A-record/load balancer for the servers, just discovery through service registration, and the clients can talk to all of our cloud Nomad servers in our single DC, regardless of the VPN gateway they connect to.

The VPN gateways are in the same cloud subnet as the servers, each managing a different set of IPs that they assign to clients. The issue is that the Nomad clients bind to the IPs assigned by the VPNs, and the subnet for the IPs differs per gateway to avoid routing conflicts. We are using the HA/failover config described in this article from OpenVPN: https://openvpn.net/community-resources/implementing-a-load-balancing-failover-configuration/ which essentially recommends separate servers with different virtual IP pools.

If it's a strict requirement for the Nomad client IPs to stay the same, we can try looking at ways to keep static client IPs across the gateways.

We have tuned heartbeat_grace to account for the restarts. Thanks for the suggestion + the help

Fixes #10200 **The bug** A user reported receiving the following error when an alloc was placed that needed to preempt existing allocs: ``` [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" ``` The previous alloc (8e) was already complete on the client. This is possible if an alloc stops *after* the scheduling decision was made to preempt it, but *before* the node running both allocations was able to pull and start the preemptor. While that is hopefully a narrow window of time, you can expect it to occur in high throughput batch scheduling heavy systems. However the RPC error made no sense! `previous_alloc` in the logs was a valid 36 character UUID! **The fix** The fix is: ``` - prevAllocID: c.Alloc.PreviousAllocation, + prevAllocID: watchedAllocID, ``` The alloc watcher new func used for preemption improperly referenced Alloc.PreviousAllocation instead of the passed in watchedAllocID. When multiple allocs are preempted, a watcher is created for each with watchedAllocID set properly by the caller. In this case Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters` error was coming from! Sadly we were properly referencing watchedAllocID in the log, so it made the error make no sense! **The repro** I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl) and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl) for ease of repro. First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad), then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad) that evicts 2 low priority jobs. Everything worked as expected. However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147), it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.

schmichael · 2022-04-26T20:16:41Z

I think #12779 will fix this issue. Even if it doesn't fit it entirely, it will at least fix the strange UUID must be 36 characters error message and give us new information to work with!

Please don't hesitate to reopen this issue (or open a new one) if this issue persists after 1.3.0-rc1, 1.2.7, and 1.1.13 are released.

github-actions · 2022-10-08T02:36:17Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

sgtang added the type/bug label Mar 19, 2021

tgross self-assigned this Mar 24, 2021

tgross added theme/client-restart stage/waiting-reply labels Mar 24, 2021

tgross changed the title ~~"UUID must be 36 characters" from rebooted client when performing RPC to server~~ Alloc.GetAlloc fails on rebooted client Mar 24, 2021

tgross removed the stage/waiting-reply label May 4, 2021

tgross added the stage/waiting-reply label May 4, 2021

tgross added stage/needs-investigation and removed stage/waiting-reply labels Jul 7, 2021

tgross removed their assignment Jul 7, 2021

schmichael mentioned this issue Apr 26, 2022

client: fix waiting on preempted alloc #12779

Merged

schmichael closed this as completed in #12779 Apr 26, 2022

This was referenced Apr 26, 2022

Backport of client: fix waiting on preempted alloc into release/1.1.x #12789

Merged

Backport of client: fix waiting on preempted alloc into release/1.2.x #12790

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alloc.GetAlloc fails on rebooted client #10200

Alloc.GetAlloc fails on rebooted client #10200

sgtang commented Mar 19, 2021

tgross commented Mar 24, 2021

sgtang commented Mar 24, 2021

tgross commented May 4, 2021

sgtang commented May 6, 2021 •

edited

Loading

jlambertjc commented May 13, 2021

tgross commented Jun 3, 2021

tgross commented Jun 3, 2021

sgtang commented Jun 15, 2021

tgross commented Jun 16, 2021

sgtang commented Jun 21, 2021 •

edited

Loading

schmichael commented Apr 26, 2022

github-actions bot commented Oct 8, 2022

Alloc.GetAlloc fails on rebooted client #10200

Alloc.GetAlloc fails on rebooted client #10200

Comments

sgtang commented Mar 19, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs (if appropriate)

tgross commented Mar 24, 2021

sgtang commented Mar 24, 2021

tgross commented May 4, 2021

sgtang commented May 6, 2021 • edited Loading

jlambertjc commented May 13, 2021

tgross commented Jun 3, 2021

tgross commented Jun 3, 2021

sgtang commented Jun 15, 2021

tgross commented Jun 16, 2021

sgtang commented Jun 21, 2021 • edited Loading

schmichael commented Apr 26, 2022

github-actions bot commented Oct 8, 2022

sgtang commented May 6, 2021 •

edited

Loading

sgtang commented Jun 21, 2021 •

edited

Loading