Errors after saveral node-drains, and reboots #2328

tantra35 · 2017-02-17T13:10:59Z

Nomad version

0.5.4

We make several times node-drain and reboots, and after that we've got follow erros:

 failed to remove the secret dir "/var/lib/nomad/alloc/1275a819-8161-b4b4-ad0b-baeaa02a2359/cassandraLocal/secrets": unmount: invalid argument

    2017/02/17 14:58:24.930392 [ERR] client: failed to destroy context for alloc 'dd74ff0c-d577-9317-fbe8-6e516133a244': 1 error(s) occurred:


* 1 error(s) occurred:


* failed to remove the secret dir "/var/lib/nomad/alloc/dd74ff0c-d577-9317-fbe8-6e516133a244/dynamoDBLocal/secrets": unmount: invalid argument

    2017/02/17 14:58:24.942293 [ERR] client: failed to destroy context for alloc '8fab41ed-5971-9f04-82a1-c75763404bf3': 1 error(s) occurred:


* 1 error(s) occurred:


* failed to remove the secret dir "/var/lib/nomad/alloc/8fab41ed-5971-9f04-82a1-c75763404bf3/webforum/secrets": unmount: invalid argument

    2017/02/17 14:58:24.967397 [ERR] client: failed to destroy context for alloc '2016432a-578d-d6a7-9789-b3e7363de885': 1 error(s) occurred:


* 1 error(s) occurred:


* failed to remove the secret dir "/var/lib/nomad/alloc/2016432a-578d-d6a7-9789-b3e7363de885/cassandraLocal/secrets": unmount: invalid argument

    2017/02/17 15:08:08.516948 [ERR] http: Request /v1/client/allocation/46509d21-5530-b324-88d9-e27b31d3a042/stats, error: unknown allocation ID "46509d21-5530-b324-88d9-e27b31d3a042"

    2017/02/17 15:08:52.388111 [ERR] http: Request /v1/client/allocation/46509d21-5530-b324-88d9-e27b31d3a042/stats, error: unknown allocation ID "46509d21-5530-b324-88d9-e27b31d3a042"

    2017/02/17 15:09:55.638789 [ERR] http: Request /v1/client/allocation/46509d21-5530-b324-88d9-e27b31d3a042/stats, error: unknown allocation ID "46509d21-5530-b324-88d9-e27b31d3a042"

==> Caught signal: terminated

And after that nomad doesn't lauched anytime with follow errors:

==> Starting Nomad agent...

==> Error starting agent: client setup failed: failed to restore state: 9 error(s) occurred:


* failed to find task dir metadata for alloc "00162589-3e6b-b1e8-fc0a-cfdb01a686e4" task "webphp"

* failed to find task dir metadata for alloc "1275a819-8161-b4b4-ad0b-baeaa02a2359" task "cassandraLocal"

* failed to find task dir metadata for alloc "17779ff1-e7b3-7b2b-c8e1-697492a1545b" task "dynamoDBLocal"

* failed to find task dir metadata for alloc "2016432a-578d-d6a7-9789-b3e7363de885" task "cassandraLocal"

* failed to find task dir metadata for alloc "5135da0b-96b7-4af3-579d-9986bc9d6964" task "msgserver"

* failed to find task dir metadata for alloc "5b47d90a-0ac1-f2fc-c3b8-1cfd1537f417" task "mysql"

* failed to find task dir metadata for alloc "630e9dcd-10fc-83b2-03e3-1c67e0fd99f0" task "gateway"

* failed to find task dir metadata for alloc "99764266-ff57-5bfd-f042-71b71b5f8f94" task "webforum"

* failed to find task dir metadata for alloc "ddab1f87-2653-32a9-2ff2-af82c152ec18" task "statsite"

    Loaded configuration from /etc/nomad/nomad.json

This error have happened again and again, without any chance to normal nomad launch. So we've stopped nomad agent, then deleted alloc dir and client dir, and at last we have started nomad, and it begins work perfectly

The text was updated successfully, but these errors were encountered:

dadgar · 2017-02-28T01:18:40Z

@tantra35 I tried reproducing this on 0.5.4 and the 0.5.5-rc1. Can you provide particular steps to reproduce.

tantra35 · 2017-02-28T17:18:13Z

All jobs on that node was tied to it, with constraint:

    constraint
    {
        attribute = "${node.unique.name}"
        value = "social"
    }

We made follow sequence of steps:

nomad node-drain -self
reboot

then againt we made

nomad node-drain -self
reboot

and at third time again made

nomad node-drain -self
reboot

and after that nomad begin fail with messages above

dadgar · 2017-02-28T18:02:18Z

Can you give your client config and the job you were running?

tantra35 · 2017-02-28T21:06:03Z

nomad.json

{
    "datacenter": "test",
    "data_dir": "/var/lib/nomad/",

    "enable_syslog": true,
    "log_level": "WARN",

    "bind_addr": "192.168.31.220",

    "consul":
    {
        "token": "9c3eeb80-3b03-4cf6-9992-6078846e2a0b"
    },

    "server":
    {
        "enabled": true,
        "bootstrap_expect": 3
    },

    "client":
    {
        "enabled": true,
        "max_kill_timeout": "10m",

        "options":
        {
            "driver.raw_exec.enable": "1",
            "docker.privileged.enabled": true
        },

        "reserved":
        {
            "cpu": 100,
            "memory": 1024
        }
    }
}

some of our jobs:

job "webphp"
{
    # Run the job in the global region, which is the default.
    # region = "global"

    # Specify the datacenters within the region this job can run in.
    datacenters = ["test"]

    # Service type jobs optimize for long-lived services. This is
    # the default but we can change to batch for short-lived tasks.
    # type = "service"

    # Priority controls our access to resources and scheduling priority.
    # This can be 1 to 100, inclusively, and defaults to 50.
    # priority = 50
    priority = 70

    # Restrict our job to only linux.
    constraint
    {
        attribute = "${attr.kernel.name}"
        value = "linux"
    }

    constraint
    {
        attribute = "${node.unique.name}"
        value = "social"
    }

    constraint
    {
        distinct_hosts = true
    }

    update
    {
        stagger = "10s"
        max_parallel = 1
    }

    group "webphp"
    {
        count = 1

        task "webphp"
        {
            driver = "docker"

            artifact
            {
                source = "http://docker.playrix.local/playrix-php-common-t07.tar.gz"

                options
                {
                    archive=false
                }
            }

            config
            {
                image = "playrix/php-common:t07"
                load = ["playrix-php-common-t07.tar.gz"]
                network_mode = "plrx-aws"
                ipc_mode = "host"
                hostname = "webphp"
                dns_servers = ["172.17.0.1"]
                command = "/sbin/init_plrx"
                args = ["-c", "/usr/bin/runsvdir -P /etc/service/"]

                volumes = [
                    "/home/www:/home/www",
                    "/var/run/mysqld:/var/run/mysqld"
                ]

                port_map
                {
                    appport = 8080
                }
            }

            env
            {
                APPNAME = "webphp"
            }

            logs
            {
                max_files = 3
                max_file_size = 10
            }

            resources
            {
                memory = 1100
                cpu = 30

                network
                {
                    mbits = 10
                    port "appport"
                    {
                        static = "8080"
                    }
                }
            }
        }
    }
}

schmichael · 2017-03-03T17:49:08Z

#2391 and #2394 should fix the first errors you saw, and I was unable to reproduce the state file corruption error you saw after that.

I'm closing for now, and 0.5.5rc2 should be out shortly. Please open a new ticket if you still run into problems!

Thanks for the bug report!

hmalphettes · 2017-03-07T19:04:56Z

Hi,

I am afraid I still faced that issue with 0.5.5rc2 on a single machine.
I did run and stopped a number of times the same job.
At some point, nomad crashed and systemd failed to restart it.

nomad --version
Nomad v0.5.5-rc2 (f6c67535d120a03b7a3f6021ce20c4623f558072+CHANGES)

Here are the journald logs:

Mar 07 18:49:01 ip-172-31-23-19 bash[28305]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Mar 07 18:49:01 ip-172-31-23-19 bash[28305]: Loaded configuration from /data/stoic/etc/nomad/config.hcl
Mar 07 18:49:01 ip-172-31-23-19 bash[28305]: ==> Starting Nomad agent...
Mar 07 18:49:01 ip-172-31-23-19 bash[28305]: ==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:
Mar 07 18:49:01 ip-172-31-23-19 bash[28305]: * failed to find task dir metadata for alloc "cba86622-6a4e-315d-d292-19897fc6d354" task "marcus"
Mar 07 18:49:01 ip-172-31-23-19 systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE

dadgar · 2017-03-07T19:36:28Z

@hmalphettes Can you paste the full logs so we can see the crash? So your reproduction steps were to just:

nomad run (docker job)?
nomad stop job
Go back to 1 a few times
Restart nomad

Also was this a fresh install or in-place update of the client? was there existing data in the data-dir?

schmichael · 2017-03-07T21:33:52Z

What was the status of the "cba86622-6a4e-315d-d292-19897fc6d354" alloc and "marcus" task? Can you provide any logs related to them from the run prior to the error?

hmalphettes · 2017-03-08T02:52:53Z

@dadgar, @schmichael my apologies for the fuzzy report. I certainly don't want to create some uncertainty and I am convinced 0.5.5-rc2 is fixing a number of those cases.

Run nomad as a non-root user via a systemd unit file with Restart=on-failure.
nomad run - raw_exec script; The [ long running ] job either runs or fails.
nomad stop
Go back to 1 a few times
~~At some point nomad crashed~~ nomad was stopped and systemd fails to restart it

According to the logs, the alloc "cba86622-6a4e-315d-d292-19897fc6d354" was for the marcus task and it failed. Then it was immediately GC-ed.

The logs are saying that nomad was asked to shutdown and I cannot remember manually doing that - I was focused on making my code run and not looking much at nomad; if the description of the scenario contains too much uncertainty, please disregard and I'll provide something more reproducible next time.

Here are the logs up to when nomad stopped and failed to restart:

Mar 07 18:36:04 ip-172-31-23-19 bash[25411]: ==> Monitoring evaluation "217d8493"
Mar 07 18:36:04 ip-172-31-23-19 bash[25411]: Evaluation triggered by job "marcus"
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) creating watcher
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) starting
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) initiating run
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/cba86622-6a4e-315d-d292-19897fc6d354/marcus/local/start-marcus.sh"
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) creating watcher
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) starting
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) initiating run
Mar 07 18:36:05 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:05 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/28a5b5ab-7a87-a324-fdd2-572668f52d8c/tank/local/start-tank.sh"
Mar 07 18:36:05 ip-172-31-23-19 bash[25411]: Evaluation status changed: "pending" -> "complete"
Mar 07 18:36:05 ip-172-31-23-19 bash[25411]: ==> Evaluation "217d8493" finished with status "complete"
(...)
Mar 07 18:36:06 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:06.806831 [INFO] client: task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" failed: Wait returned exit code 1, signal 0, and error <nil>
Mar 07 18:36:06 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:06.806864 [INFO] client: Restarting task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" in 12.021667367s
Mar 07 18:36:19 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:19.693388 [INFO] client: task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" failed: Wait returned exit code 1, signal 0, and error <nil>
Mar 07 18:36:19 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:19.693419 [INFO] client: Restarting task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" in 10.61973742s
Mar 07 18:36:31 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:31.167170 [INFO] client: task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" failed: Wait returned exit code 1, signal 0, and error <nil>
Mar 07 18:36:31 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:31.167217 [INFO] client: Restarting task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" in 11.329529263s
Mar 07 18:36:43 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:43.372507 [INFO] client: task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" failed: Wait returned exit code 1, signal 0, and error <nil>
Mar 07 18:36:43 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:43.372537 [INFO] client: Restarting task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" in 11.591167647s
Mar 07 18:36:55 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:55.832614 [INFO] client: task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" failed: Wait returned exit code 1, signal 0, and error <nil>
Mar 07 18:36:55 ip-172-31-23-19 bash[25408]: 2017/03/07 18:36:55.832647 [INFO] client: Restarting task "marcus" for alloc "cba86622-6a4e-315d-d292-19897fc6d354" in 12.087135845s
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) stopping
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) received finish
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08.329036 [INFO] client: marking allocation cba86622-6a4e-315d-d292-19897fc6d354 for GC
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08.329263 [INFO] client: marking allocation cba86622-6a4e-315d-d292-19897fc6d354 for GC
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) creating watcher
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) starting
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) initiating run
Mar 07 18:37:08 ip-172-31-23-19 bash[25408]: 2017/03/07 18:37:08 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/c018caf6-875a-0664-3775-21629860ad8b/marcus/local/start-marcus.sh"
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: ==> Caught signal: terminated
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.531446 [INFO] agent: requesting shutdown
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.531456 [INFO] client: shutting down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.532387 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.532403 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.532684 [INFO] client: task "postgres" for alloc "c36621ac-eaec-46f2-b073-941cc1a35d20" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.532708 [INFO] client: Restarting task "postgres" for alloc "c36621ac-eaec-46f2-b073-941cc1a35d20" in 11.320607898s
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.534789 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.534806 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.534906 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.534920 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.534992 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.535004 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.535789 [INFO] client: task "marcus" for alloc "c018caf6-875a-0664-3775-21629860ad8b" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.535811 [INFO] client: Restarting task "marcus" for alloc "c018caf6-875a-0664-3775-21629860ad8b" in 11.66373612s
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.539278 [INFO] client: task "tank" for alloc "28a5b5ab-7a87-a324-fdd2-572668f52d8c" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.539301 [INFO] client: Restarting task "tank" for alloc "28a5b5ab-7a87-a324-fdd2-572668f52d8c" in 12.021667367s
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.540069 [INFO] client: task "web" for alloc "c36621ac-eaec-46f2-b073-941cc1a35d20" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.540092 [INFO] client: Restarting task "web" for alloc "c36621ac-eaec-46f2-b073-941cc1a35d20" in 11.320607898s
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.542830 [INFO] nomad: shutting down server
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16 [WARN] serf: Shutdown without a Leave
Mar 07 18:42:16 ip-172-31-23-19 bash[25408]: 2017/03/07 18:42:16.580372 [INFO] agent: shutdown complete
Mar 07 18:42:16 ip-172-31-23-19 systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Mar 07 18:42:16 ip-172-31-23-19 systemd[1]: Unit nomad.service entered failed state.
Mar 07 18:42:16 ip-172-31-23-19 systemd[1]: nomad.service failed.
Mar 07 18:42:17 ip-172-31-23-19 systemd[1]: nomad.service holdoff time over, scheduling restart.
Mar 07 18:42:17 ip-172-31-23-19 systemd[1]: Starting Nomad...
(...)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: + /data/stoic/bin/nomad agent -config=/data/stoic/etc/nomad/config.hcl
(...)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: ==> Starting Nomad agent...
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Loaded configuration from /data/stoic/etc/nomad/config.hcl
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: ==> Starting Nomad agent...
Mar 07 18:42:19 ip-172-31-23-19 bash[26290]: Error querying servers: Get http://172.31.23.19:4646/v1/agent/members: dial tcp 172.31.23.19:4646: getsockopt: connection refused
Mar 07 18:42:19 ip-172-31-23-19 bash[26290]: + '[' 1 = 0 ']'
Mar 07 18:42:19 ip-172-31-23-19 bash[26290]: + sleep 1
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: ==> Nomad agent configuration:
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Atlas: <disabled>
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Client: true
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Log Level: INFO
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Region: global (DC: dc1)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Server: true
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: Version: 0.5.5rc2
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: ==> Nomad agent started! Log data will stream in below:
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:172.31.23.19:4647 Address:172.31.23.19:4647}]
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] raft: Node at 172.31.23.19:4647 [Follower] entering Follower state (Leader: "")
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] serf: EventMemberJoin: ip-172-31-23-19.global 172.31.23.19
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.104006 [INFO] nomad: starting 2 scheduling worker(s) for [batch system service _core]
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.104137 [INFO] client: using state directory /data/stoic/var/nomad/data/client
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.104147 [INFO] client: using alloc directory /data/stoic/var/nomad/data/alloc
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [WARN] serf: Failed to re-join any previously known node
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.104446 [INFO] nomad: adding server ip-172-31-23-19.global (Addr: 172.31.23.19:4647) (DC: dc1)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.107718 [INFO] fingerprint.cgroups: cgroups are available
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.108891 [INFO] fingerprint.consul: consul agent is available
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.114079 [WARN] fingerprint.network: Unable to parse Speed in output of '/usr/sbin/ethtool eth0'
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.123367 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.125403 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.125415 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.125423 [ERR] client: failed to open handle to task "tank" for alloc "28a5b5ab-7a87-a324-fdd2-572668f52d8c": error connecting to plugin: error creating rpc client for executor p
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating watcher
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) starting
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) initiating run
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.129639 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.129665 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.129672 [ERR] client: failed to open handle to task "marcus" for alloc "c018caf6-875a-0664-3775-21629860ad8b": error connecting to plugin: error creating rpc client for executor
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.131787 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.131819 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.131832 [ERR] client: failed to open handle to task "postgres" for alloc "c36621ac-eaec-46f2-b073-941cc1a35d20": error connecting to plugin: error creating rpc client for execut
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating watcher
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) starting
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) initiating run
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.132656 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.132667 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: * os: process already finished
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.132673 [ERR] client: failed to open handle to task "web" for alloc "c36621ac-eaec-46f2-b073-941cc1a35d20": error connecting to plugin: error creating rpc client for executor pl
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.133383 [ERR] client.consul: error discovering nomad servers: no Nomad Servers advertising service "nomad" in Consul datacenters: ["dc1"]
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating watcher
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) starting
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) initiating run
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19.143820 [INFO] client: Node ID "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) creating watcher
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) starting
Mar 07 18:42:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:19 [INFO] (runner) initiating run
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [WARN] raft: Heartbeat timeout from "" reached, starting election
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] raft: Node at 172.31.23.19:4647 [Candidate] entering Candidate state in term 8
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] raft: Election won. Tally: 1
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] raft: Node at 172.31.23.19:4647 [Leader] entering Leader state
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20.274396 [INFO] nomad: cluster leadership acquired
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) creating watcher
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) starting
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) initiating run
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) creating watcher
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) starting
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) initiating run
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/54da4efc-81cf-a83d-7570-091f
Mar 07 18:42:20 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:20.315996 [INFO] client: marking allocation 01411eb6-e7de-5641-93f3-47efe7978088 for GC
(...)
Mar 07 18:42:21 ip-172-31-23-19 bash[26290]: + nomad run marcus.nomad
(...)
Mar 07 18:42:21 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:21.493263 [ERR] sched: <Eval '438c9b2f-581f-2887-8ba8-774b0ad7e199' JobID: 'marcus'>: could not find node "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:21 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:21.493420 [ERR] sched: <Eval '438c9b2f-581f-2887-8ba8-774b0ad7e199' JobID: 'marcus'>: could not find node "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:21 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:21.493558 [ERR] sched: <Eval '438c9b2f-581f-2887-8ba8-774b0ad7e199' JobID: 'marcus'>: could not find node "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:21 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:21.493626 [WARN] nomad: eval <Eval '438c9b2f-581f-2887-8ba8-774b0ad7e199' JobID: 'marcus'> reached delivery limit, marking as failed
Mar 07 18:42:21 ip-172-31-23-19 bash[26290]: ==> Monitoring evaluation "438c9b2f"
Mar 07 18:42:21 ip-172-31-23-19 bash[26290]: Evaluation triggered by job "marcus"
Mar 07 18:42:21 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:21.612886 [INFO] client: task "marcus" for alloc "973d3e4d-b00f-e566-bc15-a6118d1a6bbf" failed: Wait returned exit code 1, signal 0, and error <nil>
Mar 07 18:42:21 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:21.612917 [INFO] client: Restarting task "marcus" for alloc "973d3e4d-b00f-e566-bc15-a6118d1a6bbf" in 10.160541869s
Mar 07 18:42:22 ip-172-31-23-19 bash[26290]: Evaluation status changed: "pending" -> "failed"
Mar 07 18:42:22 ip-172-31-23-19 bash[26290]: ==> Evaluation "438c9b2f" finished with status "failed"
(...)
Mar 07 18:42:22 ip-172-31-23-19 bash[26290]: + nomad run web.nomad
Mar 07 18:42:22 ip-172-31-23-19 bash[26290]: ==> Monitoring evaluation "0178f218"
Mar 07 18:42:22 ip-172-31-23-19 bash[26290]: Evaluation triggered by job "web"
Mar 07 18:42:23 ip-172-31-23-19 bash[26290]: Evaluation status changed: "pending" -> "complete"
Mar 07 18:42:23 ip-172-31-23-19 bash[26290]: ==> Evaluation "0178f218" finished with status "complete"
Mar 07 18:42:23 ip-172-31-23-19 systemd[1]: Started Nomad.
Mar 07 18:42:23 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:23.608760 [ERR] sched: <Eval 'ec70e533-f5ed-0591-1825-cf467318ea45' JobID: 'marcus'>: could not find node "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:23 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:23.608917 [ERR] sched: <Eval 'ec70e533-f5ed-0591-1825-cf467318ea45' JobID: 'marcus'>: could not find node "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:23 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:23.609054 [ERR] sched: <Eval 'ec70e533-f5ed-0591-1825-cf467318ea45' JobID: 'marcus'>: could not find node "3e3ba276-c900-4a1c-b237-f938673dd475"
Mar 07 18:42:23 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:23.609117 [WARN] nomad: eval <Eval 'ec70e533-f5ed-0591-1825-cf467318ea45' JobID: 'marcus'> reached delivery limit, marking as failed
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) stopping
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29.846314 [INFO] client: marking allocation 973d3e4d-b00f-e566-bc15-a6118d1a6bbf for GC
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29.847171 [INFO] client: marking allocation 973d3e4d-b00f-e566-bc15-a6118d1a6bbf for GC
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29.847965 [INFO] client: marking allocation 973d3e4d-b00f-e566-bc15-a6118d1a6bbf for GC
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) received finish
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) creating watcher
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) starting
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) initiating run
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) creating watcher
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) starting
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) initiating run
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) creating new runner (dry: false, once: false)
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) creating watcher
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) starting
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) initiating run
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/41b6ae9a-662c-08d1-add8-023cfffdb082/web/local/start-web.sh"
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/4bdfef9d-b217-e4ac-1839-8cd91770ec7e/marcus/local/start-marcus.sh"
Mar 07 18:42:29 ip-172-31-23-19 bash[26289]: 2017/03/07 18:42:29 [INFO] (runner) rendered "(dynamic)" => "/data/stoic/var/nomad/data/alloc/41b6ae9a-662c-08d1-add8-023cfffdb082/postgres/local/start-postgres.sh"
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: ==> Caught signal: terminated
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.151404 [INFO] agent: requesting shutdown
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.151414 [INFO] client: shutting down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.156188 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.156224 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.156493 [INFO] client: task "postgres" for alloc "41b6ae9a-662c-08d1-add8-023cfffdb082" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.156517 [INFO] client: Restarting task "postgres" for alloc "41b6ae9a-662c-08d1-add8-023cfffdb082" in 12.264447032s
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.159875 [INFO] nomad: shutting down server
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19 [WARN] serf: Shutdown without a Leave
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.160019 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.160043 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.160392 [INFO] client: task "web" for alloc "41b6ae9a-662c-08d1-add8-023cfffdb082" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.160427 [INFO] client: Restarting task "web" for alloc "41b6ae9a-662c-08d1-add8-023cfffdb082" in 12.264447032s
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.163399 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.163425 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.163708 [INFO] client: task "marcus" for alloc "4bdfef9d-b217-e4ac-1839-8cd91770ec7e" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.163741 [INFO] client: Restarting task "marcus" for alloc "4bdfef9d-b217-e4ac-1839-8cd91770ec7e" in 12.264447032s
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.163947 [ERR] driver.raw_exec: failed to deregister services: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.163977 [ERR] driver.raw_exec: error killing executor: connection is shut down
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.165429 [INFO] client: task "tank" for alloc "ac5b6b69-c386-fc23-2d16-6acb04f7dd1d" failed: Wait returned exit code 0, signal 0, and error unexpected EOF
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.165450 [INFO] client: Restarting task "tank" for alloc "ac5b6b69-c386-fc23-2d16-6acb04f7dd1d" in 10.160541869s
Mar 07 18:47:19 ip-172-31-23-19 bash[26289]: 2017/03/07 18:47:19.270472 [INFO] agent: shutdown complete
Mar 07 18:47:19 ip-172-31-23-19 systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Mar 07 18:47:19 ip-172-31-23-19 systemd[1]: Unit nomad.service entered failed state.
Mar 07 18:47:19 ip-172-31-23-19 systemd[1]: nomad.service failed.
Mar 07 18:47:19 ip-172-31-23-19 systemd[1]: nomad.service holdoff time over, scheduling restart.
Mar 07 18:47:19 ip-172-31-23-19 systemd[1]: Starting Nomad...
(...)
Mar 07 18:47:21 ip-172-31-23-19 bash[26956]: + /data/stoic/bin/nomad agent -config=/data/stoic/etc/nomad/config.hcl
Mar 07 18:47:21 ip-172-31-23-19 bash[26959]: Error querying servers: Get http://172.31.23.19:4646/v1/agent/members: dial tcp 172.31.23.19:4646: getsockopt: connection refused
Mar 07 18:47:21 ip-172-31-23-19 bash[26959]: + '[' 1 = 0 ']'
Mar 07 18:47:21 ip-172-31-23-19 bash[26959]: + sleep 1
Mar 07 18:47:21 ip-172-31-23-19 bash[26956]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Mar 07 18:47:21 ip-172-31-23-19 bash[26956]: Loaded configuration from /data/stoic/etc/nomad/config.hcl
Mar 07 18:47:21 ip-172-31-23-19 bash[26956]: ==> Starting Nomad agent...
Mar 07 18:47:21 ip-172-31-23-19 bash[26956]: ==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:
Mar 07 18:47:21 ip-172-31-23-19 bash[26956]: * failed to find task dir metadata for alloc "cba86622-6a4e-315d-d292-19897fc6d354" task "marcus"
Mar 07 18:47:21 ip-172-31-23-19 systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE

schmichael · 2017-03-08T20:06:44Z

@hmalphettes Thanks for the extra details! No worries on remembering the exact steps. Would be useful in the future, but Nomad should never corrupt its state files. I'm adding some code now to try to defend against that a little better.

schmichael · 2017-03-08T23:19:43Z

We're still unable to reproduce this bug, so if you can we'd appreciate as much of the following as possible:

Your systemd unit file
Your agent config
Bumping to DEBUG log level
Command history

hmalphettes · 2017-03-09T09:48:13Z

@schmichael, I figured what was going on: I am wrapping my nomad calls into a bash function.
That bash function was very happily overriding nomad stop to pkill nomad - embarrassment - I had meant that to be nomad shutdown.

At this point it behaved more like a chaos monkey that stresses how well nomad can shutdown then gets immediately restarted by systemd. Then nomad in turn immediately restarts the jobs that use the system scheduler.

The whole thing is fully automated on EC2 with some user-data.

The fact that it actually still managed to start a cluster with 2 machines and run a bunch of server jobs on them successfully is more a testament to how well nomad works than anything else.

Anyways, if we wanted to reproduce this:
0- run nomad via systemd, in the ExecStartPost: run a system job: nomad run myworker
1- instead of calling nomad stop myworker, use pkill nomad
2- rinse and repeat

Here is the systemd unit:

[Unit]
Description=Nomad
Requires=consul.service
After=consul.service

[Service]
User=stoic
EnvironmentFile=-/data/stoic/stoicenv
Environment=GOMAXPROCS=2
Restart=on-failure
ExecStart=/bin/bash -x -c "exec nomad agent -config=$DATA_STOIC/etc/nomad/config.hcl"
ExecStartPost=-/bin/bash -x -c "[ \"$_IS_MASTER\" != 'true' ] && exit 0; \
source /data/stoic/cmds.bash && cd /data/stoic/stoic-boot/jobs; \
nomad-wait-for-all-machines-ready; \
nomad run tank.nomad; nomad run marcus.nomad; nomad run web.nomad;"
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

As far as my use-case is concerned, I am fine with nomad-0.5.5.

Many thanks for your attention!

schmichael · 2017-03-09T18:37:23Z

Aha! Very helpful information, thanks!

The pkill explains some of the log lines you pasted earlier that had us pretty confused. We should still be crash/kill safe. Thanks for the report.

rickardrosen · 2017-03-14T16:35:02Z

I'm unable to start the nomad agent on a node, and we are also seeing the failed to find task dir metadata for allocerror.

The directories are created when the agent is started, and if I manually rm -rf * in the alloc dirs nomad starts as expected, but again creates all the old task alloc dirs which it on shutdown does not clean up.

The logged stop looks like this:

systemd[1]: Stopping nomad...
nomad[11948]:     2017/03/14 15:15:58 [DEBUG] plugin: nomad: 2017/03/14 15:15:58 [DEBUG] plugin: received interrupt signal (count: 1). Ignoring.
nomad[11948]: ==> Caught signal: interrupt
nomad[11948]:     2017/03/14 15:15:58.368369 [DEBUG] http: Shutting down http server
nomad[11948]:     2017/03/14 15:15:58.368389 [INFO] agent: requesting shutdown
nomad[11948]:     2017/03/14 15:15:58 [DEBUG] plugin: nomad: 2017/03/14 15:15:58 [DEBUG] plugin: received interrupt signal (count: 1). Ignoring.
nomad[11948]:     2017/03/14 15:15:58 [DEBUG] plugin: nomad: 2017/03/14 15:15:58 [DEBUG] plugin: received interrupt signal (count: 1). Ignoring.
nomad[11948]:     2017/03/14 15:15:58.368400 [INFO] client: shutting down
nomad[11948]:     2017/03/14 15:15:58.369916 [DEBUG] client.vault: stopped
nomad[11948]:     2017/03/14 15:15:58.369970 [DEBUG] client: RPC failed to server 10.75.18.177:4647: rpc error: EOF
nomad[11948]:     2017/03/14 15:15:58.419926 [INFO] agent: shutdown complete
systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: Stopped nomad.
systemd[1]: nomad.service: Unit entered failed state.
systemd[1]: nomad.service: Failed with result 'exit-code'.

Is this normal for an agent shutdown, with the RPC error etc? Looks a bit odd...

After this, what appears to be failed, shutdown the server is left with running docker containers, and the alloc dirs full of old allocation directories from long ago dead/completed/lost tasks.

Any ideas?

schmichael · 2019-05-09T16:17:48Z

Closing this ticket as its quite old and the code in question is radically different in 0.9 vs 0.5/0.6. Please open a new ticket if you're running into similar issues in a recent version of Nomad.

github-actions · 2022-11-24T02:19:42Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tantra35 changed the title ~~Errors after some drains~~ Errors after saveral node-drains, and reboots Feb 17, 2017

dadgar added type/bug theme/client stage/needs-investigation labels Feb 21, 2017

dadgar added the stage/waiting-reply label Feb 28, 2017

schmichael closed this as completed Mar 3, 2017

schmichael removed stage/needs-investigation stage/waiting-reply labels Mar 3, 2017

schmichael reopened this Mar 8, 2017

tantra35 mentioned this issue Jun 16, 2017

after restart nomad cleint can't start again #2715

Open

schmichael closed this as completed May 9, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors after saveral node-drains, and reboots #2328

Errors after saveral node-drains, and reboots #2328

tantra35 commented Feb 17, 2017 •

edited

Loading

dadgar commented Feb 28, 2017

tantra35 commented Feb 28, 2017 •

edited

Loading

dadgar commented Feb 28, 2017

tantra35 commented Feb 28, 2017 •

edited

Loading

schmichael commented Mar 3, 2017

hmalphettes commented Mar 7, 2017

dadgar commented Mar 7, 2017 •

edited

Loading

schmichael commented Mar 7, 2017

hmalphettes commented Mar 8, 2017

schmichael commented Mar 8, 2017

schmichael commented Mar 8, 2017

hmalphettes commented Mar 9, 2017 •

edited

Loading

schmichael commented Mar 9, 2017

rickardrosen commented Mar 14, 2017 •

edited

Loading

schmichael commented May 9, 2019

github-actions bot commented Nov 24, 2022

Errors after saveral node-drains, and reboots #2328

Errors after saveral node-drains, and reboots #2328

Comments

tantra35 commented Feb 17, 2017 • edited Loading

Nomad version

dadgar commented Feb 28, 2017

tantra35 commented Feb 28, 2017 • edited Loading

dadgar commented Feb 28, 2017

tantra35 commented Feb 28, 2017 • edited Loading

schmichael commented Mar 3, 2017

hmalphettes commented Mar 7, 2017

dadgar commented Mar 7, 2017 • edited Loading

schmichael commented Mar 7, 2017

hmalphettes commented Mar 8, 2017

schmichael commented Mar 8, 2017

schmichael commented Mar 8, 2017

hmalphettes commented Mar 9, 2017 • edited Loading

schmichael commented Mar 9, 2017

rickardrosen commented Mar 14, 2017 • edited Loading

schmichael commented May 9, 2019

github-actions bot commented Nov 24, 2022

tantra35 commented Feb 17, 2017 •

edited

Loading

tantra35 commented Feb 28, 2017 •

edited

Loading

tantra35 commented Feb 28, 2017 •

edited

Loading

dadgar commented Mar 7, 2017 •

edited

Loading

hmalphettes commented Mar 9, 2017 •

edited

Loading

rickardrosen commented Mar 14, 2017 •

edited

Loading