Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic from nomad in client-node mode, writing to closed channel #2588

Closed
mildred opened this issue Apr 26, 2017 · 2 comments
Closed

panic from nomad in client-node mode, writing to closed channel #2588

mildred opened this issue Apr 26, 2017 · 2 comments

Comments

@mildred
Copy link
Contributor

mildred commented Apr 26, 2017

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.5.0

Operating system and Environment details

Running on CoreOS

Issue

Apparently a failing task was running and got restarted very frequently. At some point a nomad node started responding because it had a panic.

The task was failing due to a panic in runc. The docker image is a minimal golang image that is built FROM scratch but had an entrypoint. The doker image was built from https://github.com/mildred/hello/tree/1aa0b61eacc483eb7128bc7725d5d122cd8b8714

The job was showing the following logs repeatedly on its stderr:

goroutine 1 [running, locked to thread]:
panic(0x7eb840, 0xc8201269a0)
	/usr/lib/go1.6/src/runtime/panic.go:481 +0x3e6
github.com/urfave/cli.HandleAction.func1(0xc8200ef2e8)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:478 +0x38e
panic(0x7eb840, 0xc8201269a0)
	/usr/lib/go1.6/src/runtime/panic.go:443 +0x4e9
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization.func1(0xc8200eebf8, 0xc82001a0c8, 0xc8200eed08)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:259 +0x136
github.com/opencontainers/runc/libcontainer.(*LinuxFactory).StartInitialization(0xc820051590, 0x7efe5b45b728, 0xc8201269a0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/factory_linux.go:277 +0x5b1
main.glob.func8(0xc82006ea00, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main_unix.go:26 +0x68
reflect.Value.call(0x750020, 0x901800, 0x13, 0x847d68, 0x4, 0xc8200ef268, 0x1, 0x1, 0x0, 0x0, ...)
	/usr/lib/go1.6/src/reflect/value.go:435 +0x120d
reflect.Value.Call(0x750020, 0x901800, 0x13, 0xc8200ef268, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/lib/go1.6/src/reflect/value.go:303 +0xb1
github.com/urfave/cli.HandleAction(0x750020, 0x901800, 0xc82006ea00, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:487 +0x2ee
github.com/urfave/cli.Command.Run(0x84ac18, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8e0b40, 0x51, 0x0, ...)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/command.go:191 +0xfec
github.com/urfave/cli.(*App).Run(0xc820001500, 0xc82000a100, 0x2, 0x2, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/Godeps/_workspace/src/github.com/urfave/cli/app.go:240 +0xaa4
main.main()
	/build/amd64-usr/var/tmp/portage/app-emulation/runc-1.0.0_rc2_p9/work/runc-1.0.0_rc2_p9/main.go:137 +0xe24
panic: standard_init_linux.go:178: exec user process caused "no such file or directory" [recovered]
	panic: standard_init_linux.go:178: exec user process caused "no such file or directory"

Reproduction steps

This is probably very difficult to reproduce

Nomad Client logs

Apr 26 07:47:15 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:15.347801 [ERR] driver.docker: error deregistering services: 1 error(s) occurred:
Apr 26 07:47:15 ip-10-0-0-232 nomad[1313]: * Get http://127.0.0.1:8500/v1/agent/services: dial tcp 127.0.0.1:8500: getsockopt: connection refused
Apr 26 07:47:15 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:15.477003 [INFO] client: task "61c778350859e2831c46b49862ad41c898247019-task-service" for alloc "7b359b93-e631-947a-23b2-d69f043be27d" failed:
 Wait returned exit code 2, signal 0, and error Docker container exited with non-zero exit code: 2
Apr 26 07:47:15 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:15.477028 [INFO] client: Restarting task "61c778350859e2831c46b49862ad41c898247019-task-service" for alloc "7b359b93-e631-947a-23b2-d69f043be27d" in 11.734430908s
Apr 26 07:47:28 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:28.151336 [INFO] driver.docker: created container 3cc52c5a2d65406c09ebce51ae07b35e8c497cede5d64436247b4c9ae4a2c91c
Apr 26 07:47:34 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:34.404699 [INFO] driver.docker: started container 3cc52c5a2d65406c09ebce51ae07b35e8c497cede5d64436247b4c9ae4a2c91c
Apr 26 07:47:34 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:34.597955 [WARN] client: error fetching stats of task 61c778350859e2831c46b49862ad41c898247019-task-service: stats collection hasn't started yet
Apr 26 07:47:35 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:35.816603 [WARN] client: error fetching stats of task 61c778350859e2831c46b49862ad41c898247019-task-service: stats collection hasn't started yet
Apr 26 07:47:37 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:37.110329 [WARN] client: error fetching stats of task 61c778350859e2831c46b49862ad41c898247019-task-service: stats collection hasn't started yet
Apr 26 07:47:37 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:37.484734 [ERR] driver.docker: error deregistering services: 1 error(s) occurred:
Apr 26 07:47:37 ip-10-0-0-232 nomad[1313]: * Get http://127.0.0.1:8500/v1/agent/services: dial tcp 127.0.0.1:8500: getsockopt: connection refused
Apr 26 07:47:37 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:37.613153 [INFO] client: task "61c778350859e2831c46b49862ad41c898247019-task-service" for alloc "7b359b93-e631-947a-23b2-d69f043be27d" failed: Wait returned exit code 2, signal 0, and error Docker container exited with non-zero exit code: 2
Apr 26 07:47:37 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:37.613175 [INFO] client: Restarting task "61c778350859e2831c46b49862ad41c898247019-task-service" for alloc "7b359b93-e631-947a-23b2-d69f043be27d" in 10.559060613s
Apr 26 07:47:49 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:49.118262 [INFO] driver.docker: created container 61c28837ba973245ff2cfa664678321b20fef333e730d50dc6bd9b3837f86527
Apr 26 07:47:55 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:55.146711 [INFO] driver.docker: started container 61c28837ba973245ff2cfa664678321b20fef333e730d50dc6bd9b3837f86527
Apr 26 07:47:55 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:55.430088 [WARN] client: error fetching stats of task 61c778350859e2831c46b49862ad41c898247019-task-service: stats collection hasn't started yet
Apr 26 07:47:56 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:56.603782 [ERR] driver.docker: error deregistering services: 1 error(s) occurred:
Apr 26 07:47:56 ip-10-0-0-232 nomad[1313]: * Get http://127.0.0.1:8500/v1/agent/services: dial tcp 127.0.0.1:8500: getsockopt: connection refused
Apr 26 07:47:56 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:56.724632 [WARN] client: error fetching stats of task 61c778350859e2831c46b49862ad41c898247019-task-service: stats collection hasn't started yet
Apr 26 07:47:56 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:56.804880 [INFO] client: task "61c778350859e2831c46b49862ad41c898247019-task-service" for alloc "7b359b93-e631-947a-23b2-d69f043be27d" failed: Wait returned exit code 2, signal 0, and error Docker container exited with non-zero exit code: 2
Apr 26 07:47:56 ip-10-0-0-232 nomad[1313]:     2017/04/26 07:47:56.805270 [INFO] client: Restarting task "61c778350859e2831c46b49862ad41c898247019-task-service" for alloc "7b359b93-e631-947a-23b2-d69f043be27d" in 11.577785279s
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]: panic: send on closed channel
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]: goroutine 1794115 [running]:
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]: panic(0xfbb8a0, 0xc420270450)
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]:         /usr/local/go/src/runtime/panic.go:500 +0x1a1
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]: github.com/hashicorp/nomad/command/agent.(*StreamFramer).run.func2(0xc4205f8000)
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]:         /home/schmichael/go/src/github.com/hashicorp/nomad/command/agent/fs_endpoint.go:353 +0x4c3
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]: created by github.com/hashicorp/nomad/command/agent.(*StreamFramer).run
Apr 26 07:48:01 ip-10-0-0-232 nomad[1313]:         /home/schmichael/go/src/github.com/hashicorp/nomad/command/agent/fs_endpoint.go:359 +0xa9
Apr 26 07:48:01 ip-10-0-0-232 systemd[1]: nomad-client.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${meta.sqsc.environment}",
                "Operand": "=",
                "RTarget": "staging"
            },
            {
                "LTarget": "${meta.sqsc.project}",
                "Operand": "=",
                "RTarget": "shanti-dev1"
            },
            {
                "LTarget": "${attr.kernel.name}",
                "Operand": "=",
                "RTarget": "linux"
            }
        ],
        "CreateIndex": 115195,
        "Datacenters": [
            "staging-sqsc"
        ],
        "ID": "61c778350859e2831c46b49862ad41c898247019-service",
        "JobModifyIndex": 124235,
        "Meta": null,
        "ModifyIndex": 124238,
        "Name": "61c778350859e2831c46b49862ad41c898247019-service",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "europe",
        "Status": "running",
        "StatusDescription": "",
        "TaskGroups": [
            {
                "Constraints": null,
                "Count": 3,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Name": "sqsc-job",
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 10000000000,
                    "Interval": 45000000000,
                    "Mode": "delay"
                },
                "Tasks": [
                    {
                        "Artifacts": [],
                        "Config": {
                            "port_map": [
                                {
                                    "web": 80.0
                                }
                            ],
                            "force_pull": true,
                            "image": "095348363195.dkr.ecr.eu-west-1.amazonaws.com/client-shanti-dev1-staging-mildred-hello:v7"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "DB_NAME": "dbstaging",
                            "DB_PASSWORD": "xxx",
                            "DB_PORT": "5432",
                            "DB_USERNAME": "dbadmin",
                            "DB_ENGINE": "postgres",
                            "DB_HOST": "db-shanti-dev1-staging.cvnzekqpf90m.eu-west-1.rds.amazonaws.com"
                        },
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "61c778350859e2831c46b49862ad41c898247019-task-service",
                        "Resources": {
                            "CPU": 500,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "DynamicPorts": null,
                                    "IP": "",
                                    "MBits": 10,
                                    "Public": false,
                                    "ReservedPorts": [
                                        {
                                            "Label": "web",
                                            "Value": 80
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "Checks": [
                                    {
                                        "Args": null,
                                        "Command": "",
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 10000000000,
                                        "Name": "alive",
                                        "Path": "",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "Timeout": 2000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "61c778350859e2831c46b49862ad41c898247019-task-service",
                                "PortLabel": "web",
                                "Tags": [
                                    "global"
                                ]
                            }
                        ],
                        "Templates": [
                            {
                                "ChangeMode": "restart",
                                "ChangeSignal": "",
                                "DestPath": "restart",
                                "EmbeddedTmpl": "7",
                                "LeftDelim": "{{",
                                "Perms": "0644",
                                "RightDelim": "}}",
                                "SourcePath": "",
                                "Splay": 0
                            }
                        ],
                        "User": "",
                        "Vault": null
                    }
                ]
            }
        ],
        "Type": "service",
        "Update": {
            "MaxParallel": 1,
            "Stagger": 10000000000
        },
        "VaultToken": ""
    }
}
@schmichael
Copy link
Member

This race condition was fixed in PR #2007 and released in Nomad 0.5.1. I would recommend upgrading to Nomad 0.5.6 if possible as it is the most stable release of the 0.5 series.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants