Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901

benbuzbee · 2021-07-15T00:03:54Z

Nomad version

1.0.4

Issue

We have a custom task driver and notice from our logs the following behavior. All of these logs have annotations alloc_id=190deedc-2a2e-9687-f70c-d8fbfead9da2. Our logs are marked with a + and nomad's own logs are marked with a ~

+ 2021-07-13T20:29:21.468Z Nomad requested to stop task "190deedc-2a2e-9687-f70c-d8fbfead9da2/browser/16d4fde1"
+ 2021-07-13T20:29:23.475Z	RPC call "StopTask" succeeded in 2.006652261s
~ 2021-07-13T20:29:24.403Z	reattached plugin process exited
~ 2021-07-13T20:29:24.588Z	plugin exited
~ 2021-07-13T20:29:25.169Z	marking allocation for GC
~ 2021-07-13T20:29:25.507Z	task run loop exiting

...
~ 2021-07-13T20:45:35.458Z	Starting Nomad agent...
+ 2021-07-13T20:45:35.486Z	RPC call "RecoverTask" failed in 635.538µs: error creating controller:
~ 2021-07-13T20:45:35.496Z	marking allocation for GC

As you can see, Nomad stops the thing at 20:29:25 and marks it for GC, but still tries to recover it when the agent starts up at 20:45:35, and marks it for GC again

It seems like strange behavior that it tries to recover something it has stopped and destroyed.
I also think this problem means a restart/reschedule = 0 job gets restarted anyway see repro for example

Reproduction steps

Use the example job nomad job init but set restarts = 0 and reschedules { attempts = 0, unlimited = false }

$ ./bin/nomad job run example.nomad 
==> Monitoring evaluation "5eea32f3"
    Evaluation triggered by job "example"
==> Monitoring evaluation "5eea32f3"
    Evaluation within deployment: "76699497"
    Allocation "efd1d8b9" created: node "a6367934", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "5eea32f3" finished with status "complete"

Make the job fail

./bin/nomad alloc signal efd

2021-07-14T23:53:38Z  Not Restarting  Policy allows no restarts
2021-07-14T23:53:38Z  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2021-07-14T23:53:38Z  Signaling       Task being sent a signal
2021-07-14T23:53:21Z  Started         Task started by client
2021-07-14T23:53:20Z  Task Setup      Building Task Directory
2021-07-14T23:53:20Z  Received        Task received by client

Stop the nomad client/server process
Note from the logs:

2021-07-14T23:53:38.787Z [INFO]  client.gc: marking allocation for GC: alloc_id=efd1d8b9-c51d-469d-e862-bbd642ba71c5

Start the nomad client/server process with the same data dir as before

The job is running even though it failed already and has 0/0 set as its policies:

vagrant@browser-devbox:~/core/src/third_party/cloudchamber-nomad-driver$ ./bin/nomad status 
ID       Type     Priority  Status   Submit Date
example  service  50        running  2021-07-14T23:53:20Z

It has forgotten about the times it was killed apparently

Recent Events:
Time                  Type        Description
2021-07-14T23:54:29Z  Started     Task started by client
2021-07-14T23:54:28Z  Task Setup  Building Task Directory
2021-07-14T23:54:28Z  Received    Task received by client
2021-07-14T23:53:20Z  Task Setup  Building Task Directory
2021-07-14T23:53:20Z  Received    Task received by client

It force GC'd the old one for some reason then restarted it?

2021-07-14T23:54:26.661Z [INFO]  client.gc: garbage collecting allocation: alloc_id=efd1d8b9-c51d-469d-e862-bbd642ba71c5 reason="forced collection"
2021-07-14T23:54:28.059Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=efd1d8b9-c51d-469d-e862-bbd642ba71c5 task=redis

You cant see this from nomads logs alone, I can see from our custom driver logs that before all of this the second nomad instance called RecoverTask

Expected Result

It doesnt try to recovertask after it destroyed it and gc'd it
It DEFINITELY doesn't restart it

Actual Result

It does
It definitely does :)

Attachments

first-client-run.log
second-client-run.log
example.nomad.txt

The text was updated successfully, but these errors were encountered:

notnoop · 2021-07-15T19:11:06Z

I'm afraid that I didn't reproduce this issue or discern the triggering condition so far. Can you try to reproduce it again and send the tarball of the data dirs when the client/server is stopped the first time, and after the alloc got restarted?

I'll continue inspecting the code for clues as well.

benbuzbee · 2021-07-15T21:04:36Z

Here's a tar file of a repro
https://drive.google.com/drive/folders/18j5sSVTb9ZYYw-3T61jCXLLwCQzSENwA?usp=sharing

I dont know if this is the full repro of RecoverTask being called but the shortest way with that tar to repro something that i think is broken is:

./nomad agent -config config.hcl
./nomad job run example.nomad
./nomad alloc signal <ALLOC>
./nomad status <ALLOC> (verifiy stopped and will not restart)
ctrl+c (see data-dir-first for /tmp/nomad output at this step, see first.log for log output )
./nomad agent -config config.hcl
./nomad status <alloc> (should be dead but will be running again)
ctrl+c (see data-dir-second for /tmp/nomad output at this step, see second.log for log output)

This is using the latest nomad release (also packaged in the tar)

notnoop · 2021-07-17T14:53:34Z

Thank you @benbuzbee for the details. Also, I got to reproduce it with a Linux single server/client cluster! And I plan to have a fix soon!

The problem is a variant of #5911, that disproportionately affect single server/client clusters. I'd be curious to know if it happened in production for you.

The issue is that the client pulled an outdated/stale state view while the server was applying raft logs. The client fetches its allocations with AllowStale=true, recognizing the always present-delay in propagating info and assuming that index is monotonically increasing. However, if the server is just starting up and replying raft logs, the client may see an outdated view and start tasks that should not start. That also explains why the client "forgot" its killing of the task.

In the latest repro logs, I noticed the following:

The 143c2382 allocation was updated to pending at raft index 18, running at index 19, and failed at index 25.
When the client start up the second time, RecoverTask is invoked for the task and it will ignore it and properly mark it for GC.
Later on the client gets a stale view of allocs from index 16 and 20 before it was failed, so the client starts it. The client then overwrites the alloc events. When the server catches up to the events, client ignores the failed update because JobModifyIndex hasn't changed (leaving this detail out for now).

The following are the relevant log lines. Note particularly the leader election event and client's "updated allocations" log lines and their indexes.

    2021-07-15T20:53:56.456Z [INFO]  client.gc: marking allocation for GC: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20
    2021-07-15T20:53:56.457Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
[...]
    2021-07-15T20:53:56.457Z [INFO]  client.gc: garbage collecting allocation: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20 reason="forced collection"
    2021-07-15T20:53:56.490Z [DEBUG] client.gc: alloc garbage collected: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20
    2021-07-15T20:53:57.442Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
    2021-07-15T20:53:57.442Z [INFO]  nomad.raft: entering candidate state: node="Node at 192.168.1.35:4647 [Candidate]" term=4
    2021-07-15T20:53:57.499Z [DEBUG] nomad.raft: votes: needed=1
    2021-07-15T20:53:57.499Z [DEBUG] nomad.raft: vote granted: from=192.168.1.35:4647 term=4 tally=1
    2021-07-15T20:53:57.499Z [INFO]  nomad.raft: election won: tally=1
    2021-07-15T20:53:57.499Z [INFO]  nomad.raft: entering leader state: leader="Node at 192.168.1.35:4647 [Leader]"
    2021-07-15T20:53:57.499Z [INFO]  nomad: cluster leadership acquired
    2021-07-15T20:53:57.519Z [DEBUG] client: updated allocations: index=16 total=1 pulled=1 filtered=0
    2021-07-15T20:53:57.519Z [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=0
    2021-07-15T20:53:57.540Z [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=0 errors=0
    2021-07-15T20:53:57.540Z [DEBUG] client: updated allocations: index=20 total=1 pulled=0 filtered=1
    2021-07-15T20:53:57.540Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2021-07-15T20:53:57.540Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
    2021-07-15T20:53:57.540Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20 task=redis
    2021-07-15T20:53:57.540Z [DEBUG] client: updated allocations: index=25 total=1 pulled=0 filtered=1
    2021-07-15T20:53:57.541Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1

benbuzbee · 2021-07-20T16:38:52Z

I'd be curious to know if it happened in production for you.

From the logs in my original post we at least saw RecoverTask being called for a previously Destroy'd task, and one that has restarts set to 0 to boot. I am not very certain it is the exact same root cause, but I'm happy with the plan of merging your fix for this then re-evaluating if I still see it :)

Appreciate the quick and deep look! Excited to get these new improvements into our cluster!

notnoop · 2021-07-20T16:54:12Z

From the logs in my original post we at least saw RecoverTask being called for a previously Destroy'd task, and one that has restarts set to 0 to boot. I am not very certain it is the exact same root cause

This RecoverTask invocation is not surprising to me. The client keeps a local state info for an alloc until the server has GCed it. RecoverTask aims to load local state from data dir and reattach to any existing running processes; and while arguably we should handle dead tasks maybe a bit differently, we haven't witnessed the need for this optimizations/special-casing yet.

Let us know how your experience goes!

benbuzbee · 2021-07-21T00:10:17Z

Ok so I understand from that that it's expected RecoverTask gets called and should just return an error since the task is dead, not necessarily a problem just makes it hard for our custom driver to figure out the difference between a real error in RecoverTask, and an inevitable error because nomad tried to Recover something it already asked us to Destroy. Not sure the best way to filter out the noise from error monitoring

github-actions · 2022-10-17T02:45:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

benbuzbee added the type/bug label Jul 15, 2021

jrasell added stage/needs-investigation theme/client labels Jul 15, 2021

notnoop mentioned this issue Jul 17, 2021

client: avoid acting on stale data after launch #10907

Merged

notnoop closed this as completed in #10907 Jul 20, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901

Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901

benbuzbee commented Jul 15, 2021

notnoop commented Jul 15, 2021

benbuzbee commented Jul 15, 2021

notnoop commented Jul 17, 2021 •

edited

Loading

benbuzbee commented Jul 20, 2021 •

edited by notnoop

Loading

notnoop commented Jul 20, 2021

benbuzbee commented Jul 21, 2021

github-actions bot commented Oct 17, 2022

Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901

Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901

Comments

benbuzbee commented Jul 15, 2021

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Attachments

notnoop commented Jul 15, 2021

benbuzbee commented Jul 15, 2021

notnoop commented Jul 17, 2021 • edited Loading

benbuzbee commented Jul 20, 2021 • edited by notnoop Loading

notnoop commented Jul 20, 2021

benbuzbee commented Jul 21, 2021

github-actions bot commented Oct 17, 2022

notnoop commented Jul 17, 2021 •

edited

Loading

benbuzbee commented Jul 20, 2021 •

edited by notnoop

Loading