Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Clarification on mode=fail in job spec #2286

Closed
BSick7 opened this issue Feb 6, 2017 · 6 comments
Closed

[question] Clarification on mode=fail in job spec #2286

BSick7 opened this issue Feb 6, 2017 · 6 comments

Comments

@BSick7
Copy link

BSick7 commented Feb 6, 2017

A job contains a single task group with the following:

                "RestartPolicy": {
                    "Interval": 0,
                    "Attempts": 1,
                    "Delay": 0,
                    "Mode": "fail"
                },

If this job fails with exit code 1, it will continue to repeat the task.
Am I misunderstanding the nomad docs or is this expected?
How can I get the desired behavior of allowing only 1 execution (success or failure)?

(tail snippet of job status)

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
3cdd82a5  6a13d452  c809c1ad  deploy      run      running  02/06/17 10:43:41 UTC

(tail snippet of alloc status of 3cdd82a5)

Recent Events:
Time                   Type        Description
02/06/17 13:21:30 UTC  Started     Task started by client
02/06/17 13:21:29 UTC  Restarting  Task restarting in 1ns
02/06/17 13:21:29 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
02/06/17 13:21:22 UTC  Started     Task started by client
02/06/17 13:21:21 UTC  Restarting  Task restarting in 1ns
02/06/17 13:21:21 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
02/06/17 13:21:13 UTC  Started     Task started by client
02/06/17 13:21:12 UTC  Restarting  Task restarting in 1ns
02/06/17 13:21:12 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
02/06/17 13:21:06 UTC  Started     Task started by client
@dadgar
Copy link
Contributor

dadgar commented Feb 6, 2017

@BSick7 This seems like a bug. What version of Nomad are you on and can you provide a job that reproduces?

@BSick7
Copy link
Author

BSick7 commented Feb 6, 2017

Nomad version

nomad v0.5.2

Job spec (json)

I have scrubbed the docker image used as its private.
The important thing is that the command fails with exit code 1.

{
    "Job": {
        "Region": "us-east",
        "ID": "foo-migration",
        "ParentID": "",
        "Name": "foo-migration",
        "Type": "batch",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "us-east-1b",
            "us-east-1d",
            "us-east-1e"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "deploy",
                "Count": 1,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "foo-migration",
                        "Driver": "docker",
                        "User": "",
                        "Config": {
                            "args": [
                                "exec",
                                "rails",
                                "db:migrate",
                                "db:seed"
                            ],
                            "command": "bundle",
                            "image": "<scrubbed-image>"
                        },
                        "Constraints": null,
                        "Env": {},
                        "Services": null,
                        "Resources": {
                            "CPU": 250,
                            "MemoryMB": 128,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "Networks": null
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 10,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null,
                        "Vault": null,
                        "Templates": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 0,
                    "Attempts": 1,
                    "Delay": 0,
                    "Mode": "fail"
                },
                "EphemeralDisk": {
                    "Sticky": false,
                    "Migrate": false,
                    "SizeMB": 300
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null,
        "VaultToken": "",
        "Status": "dead",
        "StatusDescription": "",
        "CreateIndex": 3167431,
        "ModifyIndex": 3167459,
        "JobModifyIndex": 3167431
    }
}

@dadgar dadgar added this to the v0.5.5 milestone Feb 7, 2017
dadgar added a commit that referenced this issue Feb 13, 2017
This PR ensures that the interval specified is not less than 5 seconds.

Fixes #2286
@dadgar
Copy link
Contributor

dadgar commented Feb 13, 2017

So the behavior was "correct" but very unexpected. Your policy was allowing the job to restart 1 time within a zero interval since it started. So every time it would restart it would be in a new interval and get up to one restart.

PR linked validates that there is a sane minimum interval.

@BSick7
Copy link
Author

BSick7 commented Feb 14, 2017

@dadgar Does this mean that a job can restart in mode=fail?
Is it safe to assume the way to get my original intention of single failure is to set interval to an extremely high duration to prevent restarts?

@dadgar
Copy link
Contributor

dadgar commented Feb 14, 2017

@BSick7 Exactly. So if you had an interval of 5 minutes, mode "fail" and retry count of 3, the task could restart up to 3 times within 5 minutes before we fail it. If you set the interval to something extremely large, it will effectively be the max restarts forever.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants