Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Succesfully completed batch job is re-run with new allocation. #4299

Closed
nugend opened this issue May 15, 2018 · 49 comments
Closed

Succesfully completed batch job is re-run with new allocation. #4299

nugend opened this issue May 15, 2018 · 49 comments

Comments

@nugend
Copy link

nugend commented May 15, 2018

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

3.10.0-327.36.3.el7.x86_64

Issue

A batch job executed, completed successfully and then several hours later, when the allocation was garbage collected was re-run.

Reproduction steps

Not sure. Seems to be happening frequently on our cluster though.

Nomad logs

    2018/05/14 23:30:50.541581 [DEBUG] worker: dequeued evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.541765 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/14 23:30:50.546101 [DEBUG] worker: submitted plan at index 355103 for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.546140 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/14 23:30:50.547618 [DEBUG] worker: updated evaluation <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/14 23:30:50.547683 [DEBUG] worker: ack for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa

    2018/05/14 23:30:52.074437 [DEBUG] client: starting task runners for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/14 23:30:52.074769 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '5d0016ac-dd71-6626-f929-6398e80ef28e')
2018-05-14T23:30:52.085-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/5d0016ac-dd71-6626-f929-6398e80ef28e/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/14 23:34:32.288406 [INFO] client: task "REDACTED-task" for alloc "5d0016ac-dd71-6626-f929-6398e80ef28e" completed successfully
    2018/05/14 23:34:32.288438 [INFO] client: Not restarting task: REDACTED-task for alloc: 5d0016ac-dd71-6626-f929-6398e80ef28e
    2018/05/14 23:34:32.289213 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:13.888635 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 01:39:15.389175 [WARN] client: failed to broadcast update to allocation "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 01:39:15.389401 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:15.390656 [DEBUG] client: terminating runner for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/15 01:39:15.390714 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 04:25:10.541590 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 04:25:10.541626 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.119467 [DEBUG] worker: dequeued evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.139904 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/15 05:46:37.169051 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 05:46:37.169149 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to forced collection
    2018/05/15 05:46:37.169194 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.177470 [DEBUG] worker: submitted plan at index 373181 for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.177516 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/15 05:46:37.179391 [DEBUG] worker: updated evaluation <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/15 05:46:37.179783 [DEBUG] worker: ack for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:40.218701 [DEBUG] client: starting task runners for alloc '928b0562-b7ed-a3c7-d989-89519edadee9'
    2018/05/15 05:46:40.218982 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '928b0562-b7ed-a3c7-d989-89519edadee9')
2018-05-15T05:46:40.230-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/928b0562-b7ed-a3c7-d989-89519edadee9/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/15 11:50:17.836313 [INFO] client: task "REDACTED-task" for alloc "928b0562-b7ed-a3c7-d989-89519edadee9" completed successfully
    2018/05/15 11:50:17.836336 [INFO] client: Not restarting task: REDACTED-task for alloc: 928b0562-b7ed-a3c7-d989-89519edadee9
    2018/05/15 11:50:17.836698 [INFO] client.gc: marking allocation 928b0562-b7ed-a3c7-d989-89519edadee9 for GC

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${node.unique.id}",
                "Operand": "=",
                "RTarget": "52c7e5be-a5a0-3a34-1051-5209a91a0197"
            }
        ],
        "CreateIndex": 393646,
        "Datacenters": [
            "dc1"
        ],
        "ID": "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2",
        "JobModifyIndex": 393646,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 393673,
        "Name": "REDACTED",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "REDACTED/dispatch-1526033570-3cdd72d9",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Stable": false,
        "Status": "dead",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1526403442162340993,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "${attr.os.signals}",
                        "Operand": "set_contains",
                        "RTarget": "SIGTERM"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "REDACTED",
                "ReschedulePolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "DelayFunction": "constant",
                    "Interval": 86400000000000,
                    "MaxDelay": 0,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 15000000000,
                    "Interval": 86400000000000,
                    "Mode": "fail"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "command": "REDACTED",
                            "args": [REDACTED]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {REDACTED},
                        "KillSignal": "SIGTERM",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "REDACTED",
                        "Resources": {
                            "CPU": 100,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "Networks": null
                        },
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": null
            }
        ],
        "Type": "batch",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}

What I can tell you for sure is that the allocation ran to completion and exited successfully.

We're going to try turning off the reschedule and restart policies to see if that has any effect since we're taking care of re-running these on any sort of job failure anyway.

@Lamboona
Copy link

We have also been having this behaviour since the rechedule policies were introduced in 0.8.
We use dispatch to schedule paramaterized jobs, but we also noticed the same behaviour, at some point towards the end of gc cycle it decides that the job never ran and runs it again with a new allocation.

We currently have a partial workaround, but would really appreciate a fix.

We run 3 servers tied into an asg of clients if that helps at all.

@chelseakomlo
Copy link
Contributor

Thanks for reporting this issue- we will investigate this further on our end.

@nugend
Copy link
Author

nugend commented May 17, 2018

No effect. We will be rolling back to a previous version of Nomad.

@Lamboona Could you let me know what your workaround was?

@Lamboona
Copy link

@nugend We run a lot of ephemeral jobs with small units of work, so we just discounted the work that was done out of order. It only partially works, but the extra work it causes is negligible enough that we can live with it, for now.

The benefit we get from having the reschedule policy in the firstplace far outweighs the cost of the extra work.

@dadgar
Copy link
Contributor

dadgar commented May 21, 2018

@nugend or @Lamboona Can you show the allocation output from the API of the original and retried allocations and the evaluations attached to the job

@nugend
Copy link
Author

nugend commented May 21, 2018

@dadgar At what point? The original allocation is garbage collected when the logs say its garbage collected. Do you mean I need to guess which allocation is going to experience this issue and capture it?

Also, we're not running this version right now. I can request that we deploy this to our QA environment again if this is vital to your debugging efforts, but I'm disinclined to return to the affected version.

@preetapan
Copy link
Contributor

preetapan commented May 22, 2018

@nugend - I've been looking at this and trying to reproduce. If you have debug logging turned on and the garbage collector collected this job, you should see log lines like following if the server tried to force garbage collect the completed allocation

2018/05/21 23:09:49.869355 [DEBUG] sched.core: eval GC: scanning before index 27 (5m0s)
    2018/05/21 23:09:49.869680 [DEBUG] worker: ack for evaluation ed3bc611-108d-d651-3ee5-5f747e9ca239
    2018/05/21 23:09:49.869941 [DEBUG] worker: dequeued evaluation eb86564b-eddb-38e2-7a38-27d7344632a2
    2018/05/21 23:09:49.870068 [DEBUG] sched.core: job GC: scanning before index 27 (5m0s)
    2018/05/21 23:09:49.870170 [DEBUG] sched.core: job GC: 1 jobs, 1 evaluations, 1 allocs eligible

You haven't included server logs, would you be able to attach server logs from around the time that you think it tried to restart?

Another thing I noticed from the job file pasted is that the ID contains dispatch- so it looks like a parameterized job. Could you confirm that? I tried to create and dispatch a parameterized job with nomad job dispatch, and after it garbage collected, I ran nomad job status on it and saw output like the following. It did not create any more additional allocations, until nomad job dispatch is run again.

$nomad job status dispatch_test
ID            = dispatch_test
Name          = dispatch_test
Submit Date   = 2018-05-21T22:51:56-05:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = true

Parameterized Job
Payload           = required
Required Metadata = test_key
Optional Metadata = <none>

Parameterized Job Summary
Pending  Running  Dead
0        0        2

No dispatched instances of parameterized job found

And we'd still like the info @dadgar requested - you don't have to guess which allocs will get restarted. You should be able to capture the alloc status output when you first dispatch the job (before the 24 hour window that it uses to garbage collect), and the output of nomad job status -evals REDACTED/dispatch-***. If you can do that before, and after you see it restart with another new allocation, that would help narrow down what's going on. In other words, the steps/API output to collect are:

  1. nomad job status -evals redacted-**** right after you first dispatch the job
  2. nomad eval status -json of any eval IDs returned in step 1
  3. nomad alloc status -json of any allocation IDs returned in step 2
  4. steps 1-3 after you see the problem described here where new allocations were created for the same job that had already completed.

Thanks for the help in narrowing this down, I realize this is a lot to ask but details on the above are greatly appreciated.

@nugend
Copy link
Author

nugend commented May 22, 2018

@preetapan We were running three nodes in a combined server/client mode. I may have failed to configure the logging correctly to produce logs from both server and client or otherwise, simply did not include those specific log lines. I'll check to see if we still have those particular log statements in this case.

The job itself was not parameterized, but its parent was. The parent was not re-executed at any point (it runs for a very long time since those jobs act as dynamic schedulers of sub-tasks which they monitor for work conditions).

I'll see if we can automate that collection. The problem is that these one-off batch jobs are created dynamically as the children of the aforementioned parameterized job (or rather, of the many parameterized jobs we've engineered to work in this way). That is why I'd have to speculatively collect them.

At any rate, it may be a bit before I can justify working on this.

@preetapan
Copy link
Contributor

preetapan commented May 22, 2018

The job itself was not parameterized, but its parent was.

Yes that is expected - each child job has to be invoked explicitly via nomad job dispatch though. When trying to repro, I only see one allocation per nomad job dispatch of the parent, and then it eventually garbage collects those.

Can you share the parent job's spec while we wait for more info about the job/allocations?

@preetapan
Copy link
Contributor

Sharing job spec of job I used to try to repro:

 $ cat parameterized.nomad
job "dispatch_test" {
  datacenters = ["dc1"]

  type = "batch"

  parameterized {
    payload       = "required"
    meta_required = ["test_key"]
  }

  group "t1" {


    task "t1" {
      driver = "raw_exec"

      config {
        command = "echo"

        # When dispatched, the payload is written to a file that is then read by
        # the created task upon startup
        args = ["world"]
      }

      dispatch_payload {
        file = "config.json"
      }
    }
  }
}
 $cat config.json
{
"hey":"ho"
}

$ nomad job dispatch -meta test_key=foo dispatch_test config.json

@nugend
Copy link
Author

nugend commented May 22, 2018

No, you are misunderstanding: The job which is rescheduled is not a parameterized job. It is a one-off batch job that has been customized to run a particular batch load. The parent of that one-off job is a dispatched instance of a parameter job. I manually specified the dispatch instance as parent in the creation of the one-off batch job.

@endocrimes
Copy link
Contributor

FWIW - We've seen some similar issues with one shot batch jobs being rescheduled during our 0.8.3 testing, I hadn't made the connection to GC yet, but I can take a look for that in the logs tomorrow.

@preetapan
Copy link
Contributor

@dantoml thanks, that would help. Having the output of v1/allocation/<id> of one of these batch jobs that were rescheduled would be useful as well.

@nugend I understand now, thanks for clarifying.

@endocrimes
Copy link
Contributor

@preetapan Sadly, we had to put our 0.8.3 testing on hold for a couple of weeks after running into this, but I'll see if I can get a repro case or do more digging early next week

@preetapan
Copy link
Contributor

@Lamboona will you be able to provide any debug info here - I need the output of v1/allocation/<id> of both the completed and the rescheduled allocation. Unfortunately, without that, it is not possible to piece together exactly what happened when I can't reproduce this issue with any test batch jobs I've written.

@dadgar
Copy link
Contributor

dadgar commented May 23, 2018

To add more to what Preetha has asked for. What would be ideal is the following:

  • /v1/allocation/<alloc_id> for both the original and the rescheduled one (from the logs shared it looks like the GC happened on the client so the server should still have the original)
  • /v1/job/<name>/evaluations
  • Server and client debug level logs

@nugend
Copy link
Author

nugend commented May 24, 2018

I don't know about the other users, but in our configurations, nomad is running in both client and server mode, so if there's a way to get the allocation information from the server and not the client, I would love to know how.

@dadgar
Copy link
Contributor

dadgar commented May 24, 2018

@nugend The endpoint I gave would work in that mode /v1/allocation/<alloc_id>

@Lamboona
Copy link

@preetapan we run the GC more frequently, so i will have to make some environment changes to actually be able to inspect both of the allocations at once, we only discovered them because of the effect that they were having. If i can get a chance tomorrow i will spin up an environment with debug and the normal gc settings, to try and get some of these stats.

@Lamboona
Copy link

Here are some debug logs that i have managed to scrape together this is the original allocation.
I have copied them out of kibana, so they may not be perfect, these are just the logs pertaining to the allocation, as there are too many surrounding it for me to put more in

May 25 12:10:14 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:10:14 [INFO] (runner) rendered "(dynamic)" => "/mnt/nomad/alloc/d5df6b11-b221-a600-3050-ddd238616736/tools-graph-insertion/local/tools.env"
May 25 12:10:53 ip-10-160-67-191 nomad[9620]: 	  2018-05-25T12:10:53.209Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/mnt/nomad/alloc/d5df6b11-b221-a600-3050-ddd238616736/tools-graph-insertion/executor.out","LogLevel":"INFO"}]"
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.218924 [INFO] client: task "tools-graph-insertion" for alloc "d5df6b11-b221-a600-3050-ddd238616736" completed successfully
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.218950 [INFO] client: Not restarting task: tools-graph-insertion for alloc: d5df6b11-b221-a600-3050-ddd238616736
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.237517 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.237517 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:34 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:34.553372 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (53) is over the limit (50)
May 25 12:11:36 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:36.065482 [WARN] client: failed to broadcast update to allocation "d5df6b11-b221-a600-3050-ddd238616736"
May 25 12:11:36 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:36.089629 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:37 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:37.634305 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (51) is over the limit (50)
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.806199 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.811470 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to forced collection
May 25 12:20:24 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:24.224329 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:20:25 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:25.101524 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (52) is over the limit (50)

The following allocation 0350af5e-4203-7976-f90f-cfddae3b573f was running again by 12.12.11

May 25 12:10:10 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:10:10 [INFO] (runner) rendered "(dynamic)" => "/mnt/nomad/alloc/0350af5e-4203-7976-f90f-cfddae3b573f/tools-graph-insertion/local/tools.env"
May 25 12:10:53 ip-10-160-67-191 nomad[9620]:     2018-05-25T12:10:53.211Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/mnt/nomad/alloc/0350af5e-4203-7976-f90f-cfddae3b573f/tools-graph-insertion/executor.out","LogLevel":"INFO"}]"
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.341873 [INFO] client: task "tools-graph-insertion" for alloc "0350af5e-4203-7976-f90f-cfddae3b573f" completed successfully
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.341901 [INFO] client: Not restarting task: tools-graph-insertion for alloc: 0350af5e-4203-7976-f90f-cfddae3b573f
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.361617 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.811342 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to forced collection
May 25 12:15:17 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:17.372589 [WARN] client: failed to broadcast update to allocation "0350af5e-4203-7976-f90f-cfddae3b573f"
May 25 12:20:23 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:23.110146 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
												  2018-05-25T12:20:23.116Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-05-25T12:20:23.114Z address=/tmp/plugin814132855 network=unix
May 25 12:20:24 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:24.758808 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to number of allocations (52) is over the limit (50)
May 25 12:51:18 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:51:18.535147 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
May 25 12:51:18 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:51:18.540148 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to forced collection

I haven't been able to get any of the api responses for these out yet, having debugging turned on seems to cause the servers to bottleneck and then they dont re-elect a leader. I will try forcing this again but with debugging turned off

One thing i have noticed looking at these logs is that the runner log messages seem to have occured for both of the allocations before either of the tasks actually ran, for a little more information the job itself would have been blocked for some time before it was able to be allocated. They are posted to the cluster using /v1/job/:job_id/dispatch

TBC

@Lamboona
Copy link

@preetapan I have managed to get some of the information out of the API.

It would seem that our problem doesn't actually stem from the garbage collection but actually happens far earlier in the process, as the batch scheduler seems to be scheduling a single dispatched job multiple times. I can attach the alloc/eval output if you want or should i open a seperate issue? As i feel anything else i add might detract from the issue that others are having.

@preetapan
Copy link
Contributor

@Lamboona noted about the GC logs. I am pretty sure that's a red herring. There has to be another eval for a previously completed batch job to reschedule. So the details of the evals and allocs of the dispatched job would help here.

@Lamboona
Copy link

@preetapan Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case. Which is why i think this might be a seperate issue.

If you want me to post the output from the allocs/evals they aren't exactly small do you want them pasted directly?

@preetapan
Copy link
Contributor

Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case.

That's interesting that they have different job ids, that usually means that something else external is calling nomad job dispatch on the parameterized job.

You can add them as attachments here rather than posting them directly as comments if they are huge. Please attach a job spec as well.

I can go ahead and create another issue as necessary but right now this ticket is fine.

@Lamboona
Copy link

allocation_output.zip

I wasn't able to attach just Json files so i have zipped them together, this shows two seperate allocations that go through seperate evaluations. The jobs themselves would have been created by a single post to the HTTP API dispatch endpoint, the TOOL_RUN_ID is the way i can link them together, as these are generated at dispatch as a uuidv4.

@preetapan
Copy link
Contributor

preetapan commented May 29, 2018

@Lamboona Having looked through all the allocs/evals I can rule out that this is anything to do with rescheduling. All the evals have a TriggeredBy reason of "job-register", so the second set of dispatched jobs and the allocations for them were not triggered by the scheduler.

One thing I noticed from the SubmitTime field of the jobs in the two allocs was that they were close to 3 seconds apart.

$grep SubmitTime alloc2.json 
        "SubmitTime": 1527261125866757635,
$grep SubmitTime alloc1.json 
        "SubmitTime": 1527261128492329485,

That's telling me that those two jobIDS were created by two different calls to the job dispatch server end point, which in turn are triggered by calls to the HTTP dispatch end point.

Do you use any http client libraries when wrapping calls to Nomad, perhaps that layer is doing a retry?

@wyattanderson
Copy link

I've been noticing this more and more recently as well, and unfortunately only usually notice it well after the allocations have been garbage collected, which happens extremely frequently in our environment due to our job volume. When I've seen it, I've seen the exact same dispatched job ID in the new allocation (REDACTED/dispatch-1534540234-a09077f0/periodic-1536645600). The parameterized part of the job is optional and we don't have any other systems that would have redispatched the job other than the periodic scheduler. The retries usually occur several hours apart, and the first allocation has always succeeded.

Here's what I think are the relevant bits of the job configuration; I can add more detail if necessary later on. We don't currently have a reschedule stanza on this job.

job "REDACTED" {
    type = "batch"

    periodic {
        cron = "0 2 * * * *"
        time_zone = "America/New_York"
    }

    group "REDACTED" {
        restart {
            attempts = 1
            mode = "fail"
        }
    }
}

I'll see what I can do to collect allocation and evaluation logs when the job completes, but I figured I'd jump on this issue in the meantime.

@wyattanderson
Copy link

A hunch: our servers (for historical reasons) have their time zone set to EST5EDT, which is UTC -4. I haven't confirmed this yet, but I've noticed that for a few duplicate batch jobs we've seen, they've come in almost exactly 4 hours after the job was originally scheduled. Is it possible that there's a faulty comparison somewhere between tz-aware and tz-naive datetimes?

Can other folks maybe chime in with their system clock settings?

@shantanugadgil
Copy link
Contributor

@wyattanderson We run our entire infrastructure on UTC time (though things could be in any timezone)

Would it be possible to switch your setups to UTC and see if the problem goes away (or occurs anyway)?

@notnoopci
Copy link
Contributor

notnoopci commented Oct 2, 2018

In a test environment, I was able to hit this issue 1-5% depending on various configurations and potentially the position of the moon. I added my notes and reproduction steps and logs in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug .

It does seem that problem is related to GC - and looks like job deregistering somehow resulting into new allocations for re-runs. This matches @camerondavison observation of the eval being job-deregister. I suspect it's a race in purging of job and allocation from state and evaluating the job-deregister eval.

We are digging into this further and we'll be happy to assist in debugging this or testing any hypothesis/patches, as this is blocking our adoption of Nomad 0.8.

notnoopci added a commit to circleci/nomad that referenced this issue Oct 3, 2018
Fixes hashicorp#4299

There is a race between job purging and job-deregister evaluations
handling that seem to cause completed jobs to re-run unexpectedly when
they are being GCed.

Here, we make `BatchDeregister` behave just like `Deregister`, where we
submit the `job-deregister` evals after `JobBatchDeregisterRequest` is
committed to raft.
@wyattanderson
Copy link

Finally able to capture this.

The redispatched allocation:

{
    "CreateIndex": 77003310, 
    "JobVersion": 0, 
    "ClientStatus": "running", 
    "DesiredTransition": {
        "Migrate": null, 
        "ForceReschedule": null, 
        "Reschedule": null
    }, 
    "DeploymentStatus": null, 
    "ModifyIndex": 77003764, 
    "DesiredDescription": "", 
    "JobID": "REDACTED/dispatch-1538582681-b197aba4", 
    "ClientDescription": "", 
    "TaskGroup": "REDACTED", 
    "CreateTime": 1538599086273802244, 
    "NodeID": "bd770e44-1610-7479-10fe-4d2b8e9c398d", 
    "ModifyTime": 1538599088473645875, 
    "DesiredStatus": "run", 
    "Name": "REDACTED/dispatch-1538582681-b197aba4.REDACTED[0]", 
    "EvalID": "1be2b1c5-0d98-41e0-4846-eb7898185bf9", 
    "FollowupEvalID": "", 
    "ID": "3a5945f4-b606-86c8-be60-e29d7b77a7e8", 
    "RescheduleTracker": null, 
    "TaskStates": {
        "task1": {
            "LastRestart": "0001-01-01T00:00:00Z", 
            "Restarts": 0, 
            "Failed": false, 
            "State": "running", 
            "FinishedAt": "0001-01-01T00:00:00Z", 
            "StartedAt": "2018-10-03T20:38:08.363318471Z", 
            "Events": [
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task received by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Received", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847401470, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Building Task Directory", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {
                        "message": "Building Task Directory"
                    }, 
                    "TaskSignalReason": "", 
                    "Type": "Task Setup", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847468088, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "Building Task Directory", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task started by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Started", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599088351381536, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }
            ]
        }, 
        "task2": {
            "LastRestart": "0001-01-01T00:00:00Z", 
            "Restarts": 0, 
            "Failed": false, 
            "State": "running", 
            "FinishedAt": "0001-01-01T00:00:00Z", 
            "StartedAt": "2018-10-03T20:38:08.316265838Z", 
            "Events": [
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task received by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Received", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847317486, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Building Task Directory", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {
                        "message": "Building Task Directory"
                    }, 
                    "TaskSignalReason": "", 
                    "Type": "Task Setup", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847457864, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "Building Task Directory", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task started by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Started", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599088304795991, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }
            ]
        }
    }
}

and the evaluation:

{
    "PreviousEval": "", 
    "Namespace": "default", 
    "ModifyIndex": 77003323, 
    "JobID": "REDACTED/dispatch-1538582681-b197aba4", 
    "Priority": 30, 
    "WaitUntil": "0001-01-01T00:00:00Z", 
    "CreateIndex": 77003266, 
    "Type": "batch", 
    "TriggeredBy": "job-deregister", 
    "Status": "complete", 
    "AnnotatePlan": false, 
    "QueuedAllocations": {
        "REDACTED": 0
    }, 
    "NextEval": "", 
    "JobModifyIndex": 0, 
    "DeploymentID": "", 
    "ID": "1be2b1c5-0d98-41e0-4846-eb7898185bf9", 
    "EscapedComputedClass": false, 
    "Wait": 0, 
    "ClassEligibility": null, 
    "FailedTGAllocs": null, 
    "StatusDescription": "", 
    "QuotaLimitReached": "", 
    "NodeID": "", 
    "SnapshotIndex": 77003266, 
    "BlockedEval": "", 
    "LeaderACL": "", 
    "NodeModifyIndex": 0
}

@notnoopci
Copy link
Contributor

notnoopci commented Oct 3, 2018

@wyattanderson I think that matches my observation too with job-deregister evals and GC being the trigger too.

One workaround is to run separate GC process that watches for nomad completed jobs and immediately calls Deregister on them directly before the core scheduler Jobs GC process them.

@nugend
Copy link
Author

nugend commented Oct 5, 2018

Has anyone seen any of this behavior cease after upgrading to 0.8.6 or anything?

@notnoopci
Copy link
Contributor

notnoopci commented Oct 5, 2018

@nugend I was able to reproduce it with latest nomad master in my tests. In simulated environment, two jobs re-run that's mentioned in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug/blob/master/run-1538770428/output.txt#L47-L48

@nugend
Copy link
Author

nugend commented Oct 6, 2018

Holy moly, I didn't notice how much work you've put into debugging this already!

Since you have a script to reproduce and detect the issue, have you tried running it through git bisect yet? I guess it'd take a while.

I know for a fact you can put the start at 0.7.1 (we're using that in Production and haven't seen the issue).

@nugend
Copy link
Author

nugend commented Oct 24, 2018

We've noticed that we have some jobs which failed and were explicitly configured to not restart or reschedule just get scheduled again on 0.7.1

I can only conclude that this has always been an issue and it's just certain garbage collection policy changes that cause it to be more or less prevalent.

@notnoop
Copy link
Contributor

notnoop commented Nov 1, 2018

@nugend interesting. You are right in that it's GC change, as I believe the relevant change is #3982 . I have a WIP fix in #4744, but we are researching other ways to mitigate the condition before committing that PR.

I'm not so sure about pre-0.8 - in my previous job, prior to upgrading Nomad to 0.8, we saw job retries rate of ~0.05%, that we attributed to noise and infrastructure related failures; but the rate jumped to 2-3% when we upgraded to 0.8. It's possible that there is another triggering condition that I missed in the noise.

When possible, can you provide some logs and API results for the re-run jobs and associated evaluations/allocations as suggested above - that would help us tremendously.

FYI - I just joined the Nomad team, and I'll be working on this issue very soon again.

notnoop pushed a commit that referenced this issue Nov 10, 2018
Fixes #4299

Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals.

Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index.  However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals.  When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation.

This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.
preetapan pushed a commit that referenced this issue Nov 13, 2018
Fixes #4299

Upon investigating this case further, we determined the issue to be a race between applying `JobBatchDeregisterRequest` fsm operation and processing job-deregister evals.

Processing job-deregister evals should wait until the FSM log message finishes applying, by using the snapshot index.  However, with `JobBatchDeregister`, any single individual job deregistering was applied accidentally incremented the snapshot index and resulted into processing job-deregister evals.  When a Nomad server receives an eval for a job in the batch that is yet to be deleted, we accidentally re-run it depending on the state of allocation.

This change ensures that we delete deregister all of the jobs and inserts all evals in a single transactions, thus blocking processing related evals until deregistering complete.
@notnoop
Copy link
Contributor

notnoop commented Nov 13, 2018

@nugend Thanks for your patience! We have resolved this issue and intend to ship a point to release with the fix soon. Would you be able to vet a test build in your environment soon? I'll update the ticket as soon as we cut.a release.

@nugend
Copy link
Author

nugend commented Nov 13, 2018

I've already built a release from the master branch. We'll be testing it in a few days.

@preetapan
Copy link
Contributor

@nugend master has a lot of churn due to ongoing work with driver plugins, so would definitely not recommend running a build based on master in production.

@camerondavison
Copy link
Contributor

@notnoop would this problem also manifest itself on batch allocations that fail? I am seeing a batch job run too many times (set the restart policy to only 1) and once it failed it still restarted.

@shantanugadgil
Copy link
Contributor

I set the restart parameter to 0 (zero) for my AMI build jobs which I want to run only once.

@camerondavison
Copy link
Contributor

sorry. I meant to say that it restarted 4 times at least, still in process of restarting.

@nugend
Copy link
Author

nugend commented Feb 14, 2019

Just want to say this does seem to be fixed in 0.8.7 and I would suggest anyone experiencing it on an earlier build should try upgrading.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests