Service jobs fail to restart with 'text file busy' error #1697

parasyte · 2016-09-09T20:51:28Z

Nomad version

Nomad v0.4.1

This bug doesn't happen with 0.4.0.

Operating system and Environment details

$ uname -a
Linux hostname 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

Issue

When a service job terminates abnormally, Nomad sometimes fails to restart the job. It looks like Nomad is trying to recreate the chroot, even though it already exists. Here's some example logs from nomad alloc-status:

Time                   Type            Description
09/09/16 13:35:13 PDT  Not Restarting  Error was unrecoverable
09/09/16 13:35:13 PDT  Driver Failure  failed to start task 'crash' for alloc 'cdc3dbd3-ed99-4430-369d-d04ec846f1e0': Couldn't create destination file /opt/nomad/data/alloc/cdc3dbd3-ed99-4430-369d-d04ec846f1e0/crash/usr/bin/node: open /opt/nomad/data/alloc/cdc3dbd3-ed99-4430-369d-d04ec846f1e0/crash/usr/bin/node: text file busy
09/09/16 13:34:57 PDT  Restarting      Task restarting in 15.897315851s
09/09/16 13:34:57 PDT  Terminated      Exit Code: 3
09/09/16 13:34:57 PDT  Started         Task started by client
09/09/16 13:34:53 PDT  Received        Task received by client

Time                   Type            Description
09/09/16 13:22:02 PDT  Not Restarting  Error was unrecoverable
09/09/16 13:22:02 PDT  Driver Failure  failed to start task 'crash' for alloc 'dc2d245d-114d-c48e-e611-6ad9bf3527d3': Couldn't create destination file /opt/nomad/data/alloc/dc2d245d-114d-c48e-e611-6ad9bf3527d3/crash/usr/bin/curl: open /opt/nomad/data/alloc/dc2d245d-114d-c48e-e611-6ad9bf3527d3/crash/usr/bin/curl: text file busy
09/09/16 13:21:46 PDT  Restarting      Task restarting in 15.812029815s
09/09/16 13:21:46 PDT  Terminated      Exit Code: 3
09/09/16 13:21:46 PDT  Started         Task started by client
09/09/16 13:21:33 PDT  Restarting      Exceeded allowed attempts, applying a delay - Task restarting in 12.178439186s
09/09/16 13:21:33 PDT  Terminated      Exit Code: 3
09/09/16 13:21:33 PDT  Started         Task started by client
09/09/16 13:21:18 PDT  Restarting      Task restarting in 15.234249696s
09/09/16 13:21:18 PDT  Terminated      Exit Code: 3

Reproduction steps

Start a service job that returns exit code >0. Job restart will eventually fail, usually within 5 minutes.

Nomad Server logs (if appropriate)

2016/09/09 13:22:02 [DEBUG] plugin: starting plugin: /opt/nomad/nomad []string{"/opt/nomad/nomad", "executor", "/opt/nomad/data/alloc/dc2d245d-114d-c48e-e611-6ad9bf3527d3/crash/crash-executor.out"}
2016/09/09 13:22:02 [DEBUG] plugin: waiting for RPC address for: /opt/nomad/nomad
2016/09/09 13:22:02 [DEBUG] plugin: nomad: 2016/09/09 13:22:02 [DEBUG] plugin: plugin address: unix /tmp/plugin518908642
2016/09/09 13:22:02 [DEBUG] plugin: /opt/nomad/nomad: plugin process exited
2016/09/09 13:22:02.247887 [INFO] client: Not restarting task: crash for alloc: dc2d245d-114d-c48e-e611-6ad9bf3527d3
2016/09/09 13:22:02.499250 [DEBUG] client: updated allocations at index 127 (pulled 0) (filtered 3)
2016/09/09 13:22:02.500063 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 3)

Nomad Client logs (if appropriate)

Job file (if appropriate)

{
    "Job": {
        "Region": "global",
        "ID": "crash",
        "Name": "crash",
        "Type": "service",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "vagrant"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "crash",
                "Count": 1,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "crash",
                        "Driver": "exec",
                        "User": "",
                        "Config": {
                            "command": "/usr/bin/node",
                            "args": [
                                "-e",
                                "process.exit(3)"
                            ]
                        },
                        "Constraints": null,
                        "Env": null,
                        "Services": null,
                        "Resources": {
                            "CPU": 30,
                            "MemoryMB": 100,
                            "DiskMB": 100,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": null,
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 5,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Mode": "delay"
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null
    }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2016-09-12T16:45:11Z

@parasyte Even if you do the same repro case on 0.4.0 it won't happen but it is reliably happening on 0.4.1?

parasyte · 2016-09-12T17:10:47Z

@dadgar Yeah, that has been my experience so far. It happens often enough in 0.4.1 to be considered "reliably" reproducible. It's not a 100% repro, though.

I first noticed the "text file busy" issue while working on #1518; it would show up from time-to-time when my client would crash from bugs. Didn't happen at all with the same patch against 0.4.0... FWIW

dadgar · 2016-09-12T17:15:36Z

Okay thanks for the info! Not sure what has changed in that part of the code that would cause this but will look into it

dennybaa · 2016-09-15T22:27:01Z

@dadgar same issue issue(

09/15/16 22:24:38 UTC  Driver Failure  failed to start task 'backup' for alloc 'a6bcf389-8484-fe88-780e-9e750ae41e54': fork/exec bin/ls: text file busy

but when command is just set to command = "ls" it succeeds without issues.

Additional info: nomad runtime is an alpine docker container.

diptanu · 2016-09-15T22:41:40Z

@dennybaa @parasyte Can you use lsof and tell us which process is still holding onto those files?

dennybaa · 2016-09-15T22:47:35Z

@diptanu Do you have any idea how?

/ # docker exec -it nomad lsof -n
1   /usr/local/bin/nomad    pipe:[17210]
1   /usr/local/bin/nomad    pipe:[17211]
1   /usr/local/bin/nomad    pipe:[17212]
1   /usr/local/bin/nomad    socket:[17414]
1   /usr/local/bin/nomad    anon_inode:[eventpoll]
1   /usr/local/bin/nomad    socket:[17416]
1   /usr/local/bin/nomad    socket:[32184]
76  /bin/busybox    /dev/pts/3
76  /bin/busybox    /dev/pts/3
76  /bin/busybox    /dev/pts/3
76  /bin/busybox    /dev/tty
117 /usr/bin/docker /dev/pts/3
117 /usr/bin/docker /dev/pts/3
117 /usr/bin/docker /dev/pts/3
117 /usr/bin/docker socket:[32901]
117 /usr/bin/docker anon_inode:[eventpoll]
117 /usr/bin/docker socket:[32904]

Because under exec with the following config

     config {
        command = "lsof"
        args = [ "-n" ]
      }

gives NO OUTPUT ((, though finishes with 0 return code

diptanu · 2016-09-15T23:02:51Z

@dennybaa On the node where the allocation was running just do lsof /path/to/file

dennybaa · 2016-09-15T23:08:18Z

@diptanu you can see this list above ^^^ I was just saying that i can not provide the output from inside the forkexec task... since it generates no logs.

diptanu · 2016-09-15T23:10:03Z

@dennybaa You are running that lsof command inside a container. I wanted to see the output from the same node where the nomad client is running which failed to start the allocation.

dennybaa · 2016-09-15T23:12:33Z

@diptanu yes the node where the client is running is a container, my nomad clients are containers. Or you want to see the output of the host where nomad client container is running?

dennybaa · 2016-09-15T23:19:15Z

So it might be docker corresponding issue, @diptanu as you can see, I'm not running nomad on bare host, nomad is inside container.

docker exec -it nomad ps -A uxw
PID   USER     TIME   COMMAND
    1 root       0:19 /usr/local/bin/nomad agent -config /config
  137 root       0:00 ps -A uxw

jshaw86 · 2016-09-21T07:35:40Z

I sent this to @diptanu directly but just for reference here's a minimal example.

chroot_issue.zip

jshaw86 · 2016-10-04T00:43:44Z

chroot_issue.zip
final example used to repo for reference

dadgar · 2016-10-13T00:00:34Z

@jshaw86 This has been fixed right?

jshaw86 · 2016-10-13T00:08:21Z

@dadgar yea we aren't seeing the issue. I think @diptanu put a note in the code that you guys wanted to revisit it in 0.6.0 or something, maybe because the conditional copy was thought of as a hack?

dadgar · 2016-10-13T00:12:06Z

Yeah I just wanted to make sure it was fixed! Yeah we shouldn't try to rebuild the chroot through restarts. Waste of time and IO

parasyte · 2016-10-13T02:14:49Z

We back-ported the patches to 0.4.1, and have not hit the problem, even after very rigorously stress-testing the code path.

github-actions · 2022-12-19T02:11:56Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/client labels Sep 12, 2016

dadgar added this to the v0.5.0 milestone Sep 12, 2016

jshaw86 mentioned this issue Sep 12, 2016

Processes that exit > 0 behave differently in multi region env #1694

Closed

diptanu mentioned this issue Sep 24, 2016

Closing files when files are removed #1742

Merged

dadgar closed this as completed in #1742 Sep 26, 2016

diptanu reopened this Sep 26, 2016

diptanu mentioned this issue Sep 27, 2016

Avoiding copying files if they are already present in chrootw #1753

Merged

diptanu closed this as completed in #1753 Sep 27, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service jobs fail to restart with 'text file busy' error #1697

Service jobs fail to restart with 'text file busy' error #1697

parasyte commented Sep 9, 2016

dadgar commented Sep 12, 2016

parasyte commented Sep 12, 2016

dadgar commented Sep 12, 2016

dennybaa commented Sep 15, 2016

diptanu commented Sep 15, 2016

dennybaa commented Sep 15, 2016 •

edited

Loading

diptanu commented Sep 15, 2016

dennybaa commented Sep 15, 2016 •

edited

Loading

diptanu commented Sep 15, 2016 •

edited

Loading

dennybaa commented Sep 15, 2016 •

edited

Loading

dennybaa commented Sep 15, 2016

jshaw86 commented Sep 21, 2016

jshaw86 commented Oct 4, 2016

dadgar commented Oct 13, 2016

jshaw86 commented Oct 13, 2016

dadgar commented Oct 13, 2016

parasyte commented Oct 13, 2016

github-actions bot commented Dec 19, 2022

Service jobs fail to restart with 'text file busy' error #1697

Service jobs fail to restart with 'text file busy' error #1697

Comments

parasyte commented Sep 9, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

dadgar commented Sep 12, 2016

parasyte commented Sep 12, 2016

dadgar commented Sep 12, 2016

dennybaa commented Sep 15, 2016

diptanu commented Sep 15, 2016

dennybaa commented Sep 15, 2016 • edited Loading

diptanu commented Sep 15, 2016

dennybaa commented Sep 15, 2016 • edited Loading

diptanu commented Sep 15, 2016 • edited Loading

dennybaa commented Sep 15, 2016 • edited Loading

dennybaa commented Sep 15, 2016

jshaw86 commented Sep 21, 2016

jshaw86 commented Oct 4, 2016

dadgar commented Oct 13, 2016

jshaw86 commented Oct 13, 2016

dadgar commented Oct 13, 2016

parasyte commented Oct 13, 2016

github-actions bot commented Dec 19, 2022

dennybaa commented Sep 15, 2016 •

edited

Loading

dennybaa commented Sep 15, 2016 •

edited

Loading

diptanu commented Sep 15, 2016 •

edited

Loading

dennybaa commented Sep 15, 2016 •

edited

Loading