-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service jobs fail to restart with 'text file busy' error #1697
Comments
@parasyte Even if you do the same repro case on 0.4.0 it won't happen but it is reliably happening on 0.4.1? |
@dadgar Yeah, that has been my experience so far. It happens often enough in 0.4.1 to be considered "reliably" reproducible. It's not a 100% repro, though. I first noticed the "text file busy" issue while working on #1518; it would show up from time-to-time when my client would crash from bugs. Didn't happen at all with the same patch against 0.4.0... FWIW |
Okay thanks for the info! Not sure what has changed in that part of the code that would cause this but will look into it |
@dadgar same issue issue(
but when command is just set to Additional info: nomad runtime is an alpine docker container. |
@diptanu Do you have any idea how?
Because under exec with the following config
gives NO OUTPUT ((, though finishes with 0 return code |
@dennybaa On the node where the allocation was running just do |
@diptanu you can see this list above ^^^ I was just saying that i can not provide the output from inside the |
@dennybaa You are running that |
@diptanu yes the node where the client is running is a container, my nomad clients are containers. Or you want to see the output of the host where nomad client container is running? |
So it might be docker corresponding issue, @diptanu as you can see, I'm not running nomad on bare host, nomad is inside container.
|
I sent this to @diptanu directly but just for reference here's a minimal example. |
chroot_issue.zip |
@jshaw86 This has been fixed right? |
Yeah I just wanted to make sure it was fixed! Yeah we shouldn't try to rebuild the chroot through restarts. Waste of time and IO |
We back-ported the patches to 0.4.1, and have not hit the problem, even after very rigorously stress-testing the code path. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.4.1
This bug doesn't happen with 0.4.0.
Operating system and Environment details
Issue
When a service job terminates abnormally, Nomad sometimes fails to restart the job. It looks like Nomad is trying to recreate the chroot, even though it already exists. Here's some example logs from
nomad alloc-status
:Reproduction steps
Start a service job that returns exit code >0. Job restart will eventually fail, usually within 5 minutes.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: