Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service jobs fail to restart with 'text file busy' error #1697

Closed
parasyte opened this issue Sep 9, 2016 · 18 comments · Fixed by #1742
Closed

Service jobs fail to restart with 'text file busy' error #1697

parasyte opened this issue Sep 9, 2016 · 18 comments · Fixed by #1742

Comments

@parasyte
Copy link
Contributor

parasyte commented Sep 9, 2016

Nomad version

Nomad v0.4.1

This bug doesn't happen with 0.4.0.

Operating system and Environment details

$ uname -a
Linux hostname 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

Issue

When a service job terminates abnormally, Nomad sometimes fails to restart the job. It looks like Nomad is trying to recreate the chroot, even though it already exists. Here's some example logs from nomad alloc-status:

Time                   Type            Description
09/09/16 13:35:13 PDT  Not Restarting  Error was unrecoverable
09/09/16 13:35:13 PDT  Driver Failure  failed to start task 'crash' for alloc 'cdc3dbd3-ed99-4430-369d-d04ec846f1e0': Couldn't create destination file /opt/nomad/data/alloc/cdc3dbd3-ed99-4430-369d-d04ec846f1e0/crash/usr/bin/node: open /opt/nomad/data/alloc/cdc3dbd3-ed99-4430-369d-d04ec846f1e0/crash/usr/bin/node: text file busy
09/09/16 13:34:57 PDT  Restarting      Task restarting in 15.897315851s
09/09/16 13:34:57 PDT  Terminated      Exit Code: 3
09/09/16 13:34:57 PDT  Started         Task started by client
09/09/16 13:34:53 PDT  Received        Task received by client
Time                   Type            Description
09/09/16 13:22:02 PDT  Not Restarting  Error was unrecoverable
09/09/16 13:22:02 PDT  Driver Failure  failed to start task 'crash' for alloc 'dc2d245d-114d-c48e-e611-6ad9bf3527d3': Couldn't create destination file /opt/nomad/data/alloc/dc2d245d-114d-c48e-e611-6ad9bf3527d3/crash/usr/bin/curl: open /opt/nomad/data/alloc/dc2d245d-114d-c48e-e611-6ad9bf3527d3/crash/usr/bin/curl: text file busy
09/09/16 13:21:46 PDT  Restarting      Task restarting in 15.812029815s
09/09/16 13:21:46 PDT  Terminated      Exit Code: 3
09/09/16 13:21:46 PDT  Started         Task started by client
09/09/16 13:21:33 PDT  Restarting      Exceeded allowed attempts, applying a delay - Task restarting in 12.178439186s
09/09/16 13:21:33 PDT  Terminated      Exit Code: 3
09/09/16 13:21:33 PDT  Started         Task started by client
09/09/16 13:21:18 PDT  Restarting      Task restarting in 15.234249696s
09/09/16 13:21:18 PDT  Terminated      Exit Code: 3

Reproduction steps

Start a service job that returns exit code >0. Job restart will eventually fail, usually within 5 minutes.

Nomad Server logs (if appropriate)

2016/09/09 13:22:02 [DEBUG] plugin: starting plugin: /opt/nomad/nomad []string{"/opt/nomad/nomad", "executor", "/opt/nomad/data/alloc/dc2d245d-114d-c48e-e611-6ad9bf3527d3/crash/crash-executor.out"}
2016/09/09 13:22:02 [DEBUG] plugin: waiting for RPC address for: /opt/nomad/nomad
2016/09/09 13:22:02 [DEBUG] plugin: nomad: 2016/09/09 13:22:02 [DEBUG] plugin: plugin address: unix /tmp/plugin518908642
2016/09/09 13:22:02 [DEBUG] plugin: /opt/nomad/nomad: plugin process exited
2016/09/09 13:22:02.247887 [INFO] client: Not restarting task: crash for alloc: dc2d245d-114d-c48e-e611-6ad9bf3527d3
2016/09/09 13:22:02.499250 [DEBUG] client: updated allocations at index 127 (pulled 0) (filtered 3)
2016/09/09 13:22:02.500063 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 3)

Nomad Client logs (if appropriate)

Job file (if appropriate)

{
    "Job": {
        "Region": "global",
        "ID": "crash",
        "Name": "crash",
        "Type": "service",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "vagrant"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "crash",
                "Count": 1,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "crash",
                        "Driver": "exec",
                        "User": "",
                        "Config": {
                            "command": "/usr/bin/node",
                            "args": [
                                "-e",
                                "process.exit(3)"
                            ]
                        },
                        "Constraints": null,
                        "Env": null,
                        "Services": null,
                        "Resources": {
                            "CPU": 30,
                            "MemoryMB": 100,
                            "DiskMB": 100,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": null,
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 5,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Mode": "delay"
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null
    }
}
@dadgar dadgar added this to the v0.5.0 milestone Sep 12, 2016
@dadgar
Copy link
Contributor

dadgar commented Sep 12, 2016

@parasyte Even if you do the same repro case on 0.4.0 it won't happen but it is reliably happening on 0.4.1?

@parasyte
Copy link
Contributor Author

@dadgar Yeah, that has been my experience so far. It happens often enough in 0.4.1 to be considered "reliably" reproducible. It's not a 100% repro, though.

I first noticed the "text file busy" issue while working on #1518; it would show up from time-to-time when my client would crash from bugs. Didn't happen at all with the same patch against 0.4.0... FWIW

@dadgar
Copy link
Contributor

dadgar commented Sep 12, 2016

Okay thanks for the info! Not sure what has changed in that part of the code that would cause this but will look into it

@dennybaa
Copy link

@dadgar same issue issue(

09/15/16 22:24:38 UTC  Driver Failure  failed to start task 'backup' for alloc 'a6bcf389-8484-fe88-780e-9e750ae41e54': fork/exec bin/ls: text file busy

but when command is just set to command = "ls" it succeeds without issues.

Additional info: nomad runtime is an alpine docker container.

@diptanu
Copy link
Contributor

diptanu commented Sep 15, 2016

@dennybaa @parasyte Can you use lsof and tell us which process is still holding onto those files?

@dennybaa
Copy link

dennybaa commented Sep 15, 2016

@diptanu Do you have any idea how?

/ # docker exec -it nomad lsof -n
1   /usr/local/bin/nomad    pipe:[17210]
1   /usr/local/bin/nomad    pipe:[17211]
1   /usr/local/bin/nomad    pipe:[17212]
1   /usr/local/bin/nomad    socket:[17414]
1   /usr/local/bin/nomad    anon_inode:[eventpoll]
1   /usr/local/bin/nomad    socket:[17416]
1   /usr/local/bin/nomad    socket:[32184]
76  /bin/busybox    /dev/pts/3
76  /bin/busybox    /dev/pts/3
76  /bin/busybox    /dev/pts/3
76  /bin/busybox    /dev/tty
117 /usr/bin/docker /dev/pts/3
117 /usr/bin/docker /dev/pts/3
117 /usr/bin/docker /dev/pts/3
117 /usr/bin/docker socket:[32901]
117 /usr/bin/docker anon_inode:[eventpoll]
117 /usr/bin/docker socket:[32904]

Because under exec with the following config

     config {
        command = "lsof"
        args = [ "-n" ]
      }

gives NO OUTPUT ((, though finishes with 0 return code

@diptanu
Copy link
Contributor

diptanu commented Sep 15, 2016

@dennybaa On the node where the allocation was running just do lsof /path/to/file

@dennybaa
Copy link

dennybaa commented Sep 15, 2016

@diptanu you can see this list above ^^^ I was just saying that i can not provide the output from inside the forkexec task... since it generates no logs.

@diptanu
Copy link
Contributor

diptanu commented Sep 15, 2016

@dennybaa You are running that lsof command inside a container. I wanted to see the output from the same node where the nomad client is running which failed to start the allocation.

@dennybaa
Copy link

dennybaa commented Sep 15, 2016

@diptanu yes the node where the client is running is a container, my nomad clients are containers. Or you want to see the output of the host where nomad client container is running?

@dennybaa
Copy link

So it might be docker corresponding issue, @diptanu as you can see, I'm not running nomad on bare host, nomad is inside container.

docker exec -it nomad ps -A uxw
PID   USER     TIME   COMMAND
    1 root       0:19 /usr/local/bin/nomad agent -config /config
  137 root       0:00 ps -A uxw

@jshaw86
Copy link

jshaw86 commented Sep 21, 2016

I sent this to @diptanu directly but just for reference here's a minimal example.

chroot_issue.zip

@jshaw86
Copy link

jshaw86 commented Oct 4, 2016

chroot_issue.zip
final example used to repo for reference

@dadgar
Copy link
Contributor

dadgar commented Oct 13, 2016

@jshaw86 This has been fixed right?

@jshaw86
Copy link

jshaw86 commented Oct 13, 2016

@dadgar yea we aren't seeing the issue. I think @diptanu put a note in the code that you guys wanted to revisit it in 0.6.0 or something, maybe because the conditional copy was thought of as a hack?

@dadgar
Copy link
Contributor

dadgar commented Oct 13, 2016

Yeah I just wanted to make sure it was fixed! Yeah we shouldn't try to rebuild the chroot through restarts. Waste of time and IO

@parasyte
Copy link
Contributor Author

We back-ported the patches to 0.4.1, and have not hit the problem, even after very rigorously stress-testing the code path.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants