[question] Clarification on `mode=fail` in job spec #2286

BSick7 · 2017-02-06T13:22:39Z

A job contains a single task group with the following:

                "RestartPolicy": {
                    "Interval": 0,
                    "Attempts": 1,
                    "Delay": 0,
                    "Mode": "fail"
                },

If this job fails with exit code 1, it will continue to repeat the task.
Am I misunderstanding the nomad docs or is this expected?
How can I get the desired behavior of allowing only 1 execution (success or failure)?

(tail snippet of job status)

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
3cdd82a5  6a13d452  c809c1ad  deploy      run      running  02/06/17 10:43:41 UTC

(tail snippet of alloc status of 3cdd82a5)

Recent Events:
Time                   Type        Description
02/06/17 13:21:30 UTC  Started     Task started by client
02/06/17 13:21:29 UTC  Restarting  Task restarting in 1ns
02/06/17 13:21:29 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
02/06/17 13:21:22 UTC  Started     Task started by client
02/06/17 13:21:21 UTC  Restarting  Task restarting in 1ns
02/06/17 13:21:21 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
02/06/17 13:21:13 UTC  Started     Task started by client
02/06/17 13:21:12 UTC  Restarting  Task restarting in 1ns
02/06/17 13:21:12 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
02/06/17 13:21:06 UTC  Started     Task started by client

The text was updated successfully, but these errors were encountered:

dadgar · 2017-02-06T18:31:48Z

@BSick7 This seems like a bug. What version of Nomad are you on and can you provide a job that reproduces?

BSick7 · 2017-02-06T18:56:45Z

Nomad version

nomad v0.5.2

Job spec (json)

I have scrubbed the docker image used as its private.
The important thing is that the command fails with exit code 1.

{
    "Job": {
        "Region": "us-east",
        "ID": "foo-migration",
        "ParentID": "",
        "Name": "foo-migration",
        "Type": "batch",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "us-east-1b",
            "us-east-1d",
            "us-east-1e"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "deploy",
                "Count": 1,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "foo-migration",
                        "Driver": "docker",
                        "User": "",
                        "Config": {
                            "args": [
                                "exec",
                                "rails",
                                "db:migrate",
                                "db:seed"
                            ],
                            "command": "bundle",
                            "image": "<scrubbed-image>"
                        },
                        "Constraints": null,
                        "Env": {},
                        "Services": null,
                        "Resources": {
                            "CPU": 250,
                            "MemoryMB": 128,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "Networks": null
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 10,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null,
                        "Vault": null,
                        "Templates": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 0,
                    "Attempts": 1,
                    "Delay": 0,
                    "Mode": "fail"
                },
                "EphemeralDisk": {
                    "Sticky": false,
                    "Migrate": false,
                    "SizeMB": 300
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null,
        "VaultToken": "",
        "Status": "dead",
        "StatusDescription": "",
        "CreateIndex": 3167431,
        "ModifyIndex": 3167459,
        "JobModifyIndex": 3167431
    }
}

This PR ensures that the interval specified is not less than 5 seconds. Fixes #2286

dadgar · 2017-02-13T23:29:39Z

So the behavior was "correct" but very unexpected. Your policy was allowing the job to restart 1 time within a zero interval since it started. So every time it would restart it would be in a new interval and get up to one restart.

PR linked validates that there is a sane minimum interval.

BSick7 · 2017-02-14T13:35:42Z

@dadgar Does this mean that a job can restart in mode=fail?
Is it safe to assume the way to get my original intention of single failure is to set interval to an extremely high duration to prevent restarts?

dadgar · 2017-02-14T19:32:04Z

@BSick7 Exactly. So if you had an interval of 5 minutes, mode "fail" and retry count of 3, the task could restart up to 3 times within 5 minutes before we fail it. If you set the interval to something extremely large, it will effectively be the max restarts forever.

github-actions · 2022-12-16T02:12:11Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/client stage/needs-investigation labels Feb 6, 2017

dadgar added this to the v0.5.5 milestone Feb 7, 2017

dadgar added a commit that referenced this issue Feb 13, 2017

Validate the interval within a restart policy

9a9cb8d

This PR ensures that the interval specified is not less than 5 seconds. Fixes #2286

dadgar mentioned this issue Feb 13, 2017

Validate the interval within a restart policy #2311

Merged

dadgar closed this as completed in #2311 Feb 14, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Clarification on `mode=fail` in job spec #2286

[question] Clarification on `mode=fail` in job spec #2286

BSick7 commented Feb 6, 2017 •

edited

Loading

dadgar commented Feb 6, 2017

BSick7 commented Feb 6, 2017 •

edited

Loading

dadgar commented Feb 13, 2017 •

edited

Loading

BSick7 commented Feb 14, 2017

dadgar commented Feb 14, 2017

github-actions bot commented Dec 16, 2022

[question] Clarification on mode=fail in job spec #2286

[question] Clarification on mode=fail in job spec #2286

Comments

BSick7 commented Feb 6, 2017 • edited Loading

dadgar commented Feb 6, 2017

BSick7 commented Feb 6, 2017 • edited Loading

Nomad version

Job spec (json)

dadgar commented Feb 13, 2017 • edited Loading

BSick7 commented Feb 14, 2017

dadgar commented Feb 14, 2017

github-actions bot commented Dec 16, 2022

[question] Clarification on `mode=fail` in job spec #2286

[question] Clarification on `mode=fail` in job spec #2286

BSick7 commented Feb 6, 2017 •

edited

Loading

BSick7 commented Feb 6, 2017 •

edited

Loading

dadgar commented Feb 13, 2017 •

edited

Loading