-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job submitted with tcp docker endpoint fails after 60 seconds #1184
Comments
I was investigating this as well but on the windows platform which now supports docker containers and range in size from hundreds of mb's to gb's and noticed in the code that the default timeout is hard coded at 1 minute. On a slow network or in a virtualized environment i can see this being problematic. It would be nice if in the task block we could define a timeout option like below
When doing this on a linux box, the images were much smaller and didn't seem to have any issues. The error message you received is different then what i recieved: |
This will also then fail at 2 minutes. I changed the hard coded value to 2m then 10m and it failed at each point. The actual problem is that a timeout is being set for a command that could perhaps never end. The wait command should only return when the container is terminated and that could essentially be never. It's the same thing as fork() wait() docker remote api definition: |
I should add there is an easy workaround for this. Set the TCP endpoint to the unix socket and it's fine. ie. The unix socket must use a different mechanism without a timeout. |
@lfarnell actually yours is a different but related issue. I also saw this when I was testing large images pulled over a slow network. I agree with your solution that there should be a timeout for this however I would set it at the image level.
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Nomad version
Output from
nomad version
Nomad v0.3.2 - also tried 4.0 dev
Operating system and Environment details
Centos 7 current (yum updated)
Issue
setting client docker.endpoint to a TCP connection will cause the container to fail and be restarted after 60 seconds
Reproduction steps
run a single node nomad server / client not in dev mode and with docker.endpoint set to "docker.endpoint" = "tcp://0.0.0.0:2375"
or
"docker.endpoint" = "tcp://127.0.0.1:2375"
run any simple job
wait 60 seconds
Nomad Server logs (if appropriate)
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: allocs: (place 0) (update 1) (migrate 0) (stop 0) (ignore 0)
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: 1 in-place updates of 1
2016/05/18 15:31:55 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03 (294.158µs)
2016/05/18 15:31:55 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03/allocations (326.546µs)
2016/05/18 15:31:55 [DEBUG] worker: submitted plan for evaluation 88aa1256-d141-8eb5-869b-128136e3ec03
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: setting status to complete
2016/05/18 15:31:55 [DEBUG] client: updated allocations at index 84 (pulled 1) (filtered 0)
2016/05/18 15:31:55 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 0)
2016/05/18 15:31:55 [DEBUG] worker: updated evaluation <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>
2016/05/18 15:31:55 [DEBUG] worker: ack for evaluation 88aa1256-d141-8eb5-869b-128136e3ec03
2016/05/18 15:31:56 [DEBUG] client: state changed, updating node.
2016/05/18 15:31:56 [DEBUG] client: node registration complete
2016/05/18 15:31:56 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03 (146.105µs)
2016/05/18 15:31:56 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03/allocations (149.335µs)
2016/05/18 15:32:52 [ERR] driver.docker: failed to wait for 42f1211121489678df7578b91946064ad835c8040c64eb1945a1d954effbdf64; container already terminated
2016/05/18 15:32:52 [INFO] client: task "alpine" for alloc "209570af-5eeb-db1b-58cf-5181524c7d0a" failed: Wait returned exit code 0, signal 0, and error Post http://127.0.0.1:2375/containers/42f1211121489678df7578b91946064ad835c8040c64eb1945a1d954effbdf64/wait: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016/05/18 15:32:52 [INFO] client: Restarting task "alpine" for alloc "209570af-5eeb-db1b-58cf-5181524c7d0a" in 17.084959638s
2016/05/18 15:32:52 [DEBUG] plugin: /tmp/nomad/nomad: plugin process exited
2016/05/18 15:32:52 [DEBUG] client: updated allocations at index 88 (pulled 0) (filtered 1)
2016/05/18 15:32:52 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
Nomad Client logs (if appropriate)
Job file (if appropriate)
job "test" {
region = "global"
datacenters = ["dc1"]
type = "service"
priority = 50
group "alpine" {
count = 1
}
The text was updated successfully, but these errors were encountered: