Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation never starting ([DEBUG] client: added alloc "<id>" to blocked queue for previous allocation "<id>) #2711

Closed
mildred opened this issue Jun 16, 2017 · 11 comments
Labels

Comments

@mildred
Copy link
Contributor

mildred commented Jun 16, 2017

If filing a bug please include the following:

Nomad version

Nomad v0.5.6

Operating system and Environment details

Agent running on coreos. using the official binary. Not running in a container.

Issue

After updating a job specification, the allocation generated is kept pending, never starting

Task is pending:

$ nomad status 7ffa74671955883712c5efb47258f3dfd5000712-precommand
ID            = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Name          = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Type          = batch
Priority      = 75
Datacenters   = staging-sqsc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
sqsc-job    0       1         0        0       0         0

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
82e0f96a  76e4bc60  ec23c0db  sqsc-job    run      pending  06/16/17 10:11:19 CEST

Allocation not recognized by the client:

$ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad alloc-status 82e0f96a
ID                  = 82e0f96a
Eval ID             = 76e4bc60
Name                = 7ffa74671955883712c5efb47258f3dfd5000712-precommand.sqsc-job[0]
Node ID             = ec23c0db
Job ID              = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 06/16/17 08:11:19 UTC

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "82e0f96a-5082-1e69-b7e5-ad5b796b4a2b")

Client logs are saying:

Jun 16 08:11:27 ip-10-0-0-111 nomad[1560]:     2017/06/16 08:11:27.497516 [DEBUG] client: added alloc "82e0f96a-5082-1e69-b7e5-ad5b796b4a2b" to blocked queue for previous allocation "23b1cb99-4ae9-540a-258e-dc246ecc494b"

The alloc is blocked by a previous allocation. Let's look at this previous allocation, it appears pending too but also not needed due to job update:

$ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad alloc-status 23b1cb99-4ae9-540a-258e-dc246ecc494b
ID                  = 23b1cb99
Eval ID             = b44ac335
Name                = 7ffa74671955883712c5efb47258f3dfd5000712-precommand.sqsc-job[0]
Node ID             = ec23c0db
Job ID              = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created At          = 06/16/17 08:11:19 UTC

About the blocking allocation, the client logs are showing:

Jun 16 09:04:01 ip-10-0-0-111 nomad[1560]:     2017/06/16 09:04:01.063105 [INFO] driver.docker: created container 06423f100e449e15456339f1bd4be77b1ad83d573e64641d315ae61d4563c72c
Jun 16 09:04:01 ip-10-0-0-111 nomad[1560]:     2017/06/16 09:04:01.330807 [INFO] driver.docker: started container 06423f100e449e15456339f1bd4be77b1ad83d573e64641d315ae61d4563c72c
Jun 16 09:04:01 ip-10-0-0-111 nomad[1560]:     2017/06/16 09:04:01.343976 [WARN] client: error fetching stats of task 7ffa74671955883712c5efb47258f3dfd5000712-task-service: stats collection hasn't started yet
Jun 16 09:04:02 ip-10-0-0-111 nomad[1560]:     2017/06/16 09:04:02.283352 [DEBUG] client: updated allocations at index 27705 (total 4) (pulled 1) (filtered 3)
Jun 16 09:04:02 ip-10-0-0-111 nomad[1560]:     2017/06/16 09:04:02.283855 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 3)
Jun 16 09:04:02 ip-10-0-0-111 nomad[1560]:     2017/06/16 09:04:02.945577 [DEBUG] driver.docker: error collecting stats from container 06423f100e449e15456339f1bd4be77b1ad83d573e64641d315ae61d4563c72c: io: read/write on closed pipe

Jun 16 08:11:24 ip-10-0-0-111 nomad[1560]:     2017/06/16 08:11:24.445436 [DEBUG] client: alloc "23b1cb99-4ae9-540a-258e-dc246ecc494b" in terminal status, waiting for destroy
Jun 16 08:11:27 ip-10-0-0-111 nomad[1560]:     2017/06/16 08:11:27.497339 [DEBUG] client: updated allocations at index 25572 (total 4) (pulled 1) (filtered 3)
Jun 16 08:11:27 ip-10-0-0-111 nomad[1560]:     2017/06/16 08:11:27.497427 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 3)
Jun 16 08:11:27 ip-10-0-0-111 nomad[1560]:     2017/06/16 08:11:27.497516 [DEBUG] client: added alloc "82e0f96a-5082-1e69-b7e5-ad5b796b4a2b" to blocked queue for previous allocation "23b1cb99-4ae9-540a-2

We have a WARN and also alloc "23b1cb99-4ae9-540a-258e-dc246ecc494b" in terminal status, waiting for destroy. This alloc will never be destroyed.

Client node status shows this allocation, and also previous allocations still pending :

$ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad node-status ec23c0db                             
ID      = ec23c0db
Name    = qa-dev-20170613174232-staging-1
Class   = <none>
DC      = staging-sqsc
Drain   = false
Status  = ready
Drivers = docker,exec,raw_exec,rkt
Uptime  = 1h27m27s

Allocated Resources
CPU         Memory       Disk        IOPS
0/2400 MHz  0 B/995 MiB  0 B/35 GiB  0/0

Allocation Resource Utilization
CPU         Memory
0/2400 MHz  0 B/995 MiB

Host Resource Utilization
CPU          Memory           Disk
96/2400 MHz  370 MiB/995 MiB  484 MiB/36 GiB

Allocations
ID        Eval ID   Job ID                                               Task Group  Desired Status  Client Status
82e0f96a  76e4bc60  7ffa74671955883712c5efb47258f3dfd5000712-precommand  sqsc-job    run             pending
23b1cb99  b44ac335  7ffa74671955883712c5efb47258f3dfd5000712-precommand  sqsc-job    stop            pending
cb97cab1  523d9da4  7ffa74671955883712c5efb47258f3dfd5000712-precommand  sqsc-job    stop            pending
903fc513  bfa62b13  7ffa74671955883712c5efb47258f3dfd5000712-service     sqsc-job    run             pending

Detail about these allocations:

core@ip-10-0-0-111 ~ $ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad alloc-status 903fc513
ID                  = 903fc513
Eval ID             = bfa62b13
Name                = 7ffa74671955883712c5efb47258f3dfd5000712-service.sqsc-job[1]
Node ID             = ec23c0db
Job ID              = 7ffa74671955883712c5efb47258f3dfd5000712-service
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 06/16/17 08:11:17 UTC

Task "7ffa74671955883712c5efb47258f3dfd5000712-task-service" is "pending"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
500 MHz  256 MiB  300 MiB  0     web: 10.0.0.111:80

Recent Events:
Time                   Type        Description
06/16/17 09:35:19 UTC  Restarting  Task restarting in 10.131638503s
06/16/17 09:35:19 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
06/16/17 09:35:18 UTC  Started     Task started by client
06/16/17 09:35:17 UTC  Driver      Downloading image 095348363195.dkr.ecr.eu-west-1.amazonaws.com/client-qa-dev-20170613174232-staging-mildred-sqsc-demo:v1
06/16/17 09:35:06 UTC  Restarting  Task restarting in 11.141561596s
06/16/17 09:35:06 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
06/16/17 09:35:04 UTC  Started     Task started by client
06/16/17 09:35:03 UTC  Driver      Downloading image 095348363195.dkr.ecr.eu-west-1.amazonaws.com/client-qa-dev-20170613174232-staging-mildred-sqsc-demo:v1
06/16/17 09:34:52 UTC  Restarting  Task restarting in 11.290574304s
06/16/17 09:34:52 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
core@ip-10-0-0-111 ~ $ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad alloc-status cb97cab1
ID                  = cb97cab1
Eval ID             = 523d9da4
Name                = 7ffa74671955883712c5efb47258f3dfd5000712-precommand.sqsc-job[0]
Node ID             = ec23c0db
Job ID              = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created At          = 06/16/17 08:11:19 UTC
core@ip-10-0-0-111 ~ $ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad alloc-status 23b1cb99
ID                  = 23b1cb99
Eval ID             = b44ac335
Name                = 7ffa74671955883712c5efb47258f3dfd5000712-precommand.sqsc-job[0]
Node ID             = ec23c0db
Job ID              = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created At          = 06/16/17 08:11:19 UTC
core@ip-10-0-0-111 ~ $ NOMAD_ADDR=http://10.0.0.112:4646 bin/nomad alloc-status 82e0f96a
ID                  = 82e0f96a
Eval ID             = 76e4bc60
Name                = 7ffa74671955883712c5efb47258f3dfd5000712-precommand.sqsc-job[0]
Node ID             = ec23c0db
Job ID              = 7ffa74671955883712c5efb47258f3dfd5000712-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 06/16/17 08:11:19 UTC

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "82e0f96a-5082-1e69-b7e5-ad5b796b4a2b")

Reproduction steps

The environment is complex and not everything is open yet. But I am able to reproduce it any time.

It seems related to bad timing of job starting up. Like a job being reposted before the previous version of the job could complete.

Basically, here is my scenario:

  • create a cluster of 3 client nodes and associate with the already running nomad server
  • create a job corresponding to a service with count=3
  • create a db migration job for that service (count=1)
  • scale up the number of nomad agents (clients) to 5
  • run the db migration job, but the job is blocking
  • this is supposed to happen next: update the main service job to count=5 (never actually happened)

Nomad Client / Server logs / Job file

Please note that client logs are in UTC and server logs in UTC+2

See gist: https://gist.github.com/mildred/8d2af00117d0379fe2fe9ce18a365c08

@dadgar
Copy link
Contributor

dadgar commented Jun 20, 2017

Hey sorry you ran into this and thanks for the great write up! It should be fixed by this: 3018ae0

Which will be part of 0.6.0. You could rebase the commit and build 0.5.6 in the mean time!

@dadgar dadgar closed this as completed Jun 20, 2017
@capone212
Copy link
Contributor

Hi @mildred I have the same issue #2563
I have noticed today that my problem is not fixed, still experience the same long lasting pending state occasionally.
Can you please check provided fix?

@mildred
Copy link
Contributor Author

mildred commented Jun 27, 2017

Hello,

I did test commit 3018ae0 cherry-picked on top of version 0.5.6 and it seems to fix my problem. I had an automated test case that failed previously and that is now working.

Perhaps there are still edge cases where long delay can happen, I don't know. Are you sure that you used the correct nomad executable with the patch included?

@mildred
Copy link
Contributor Author

mildred commented Jun 27, 2017

By the way, I did not bother restarting the nomad servers with the patched version. Only having the clients running the patched version is enough.

@capone212
Copy link
Contributor

@mildred thanks for reply.
Hm... I am going to double check that, but it seems that my case is different and that is not fixed yet.

@mildred
Copy link
Contributor Author

mildred commented Jun 29, 2017

I have deployed the nomad version with the fix, and it seems to work better, but I still have reports of blocked allocations on the same circumstances. I have to investigate a bit more and generate debug logs.

@mildred
Copy link
Contributor Author

mildred commented Jun 29, 2017

New test, I have two allocations for the same job on the same node, one is probably blocking the other.

Logs are saying : Jun 29 16:29:11 ip-10-0-0-104 nomad[1499]: 2017/06/29 16:29:11.199986 [DEBUG] client: added alloc "1fc1d473-a513-ae2d-f0e3-609a96f9838e" to blocked queue for previous allocation "efeada5c-e403-3fc8-fa1d-b7cad384e6ed"

The blocking allocation is still pending destruction: Jun 29 16:29:10 ip-10-0-0-104 nomad[1499]: 2017/06/29 16:29:10.999021 [DEBUG] client: alloc "efeada5c-e403-3fc8-fa1d-b7cad384e6ed" in terminal status, waiting for destroy

allocation status:

core@ip-10-0-0-104 ~ $ NOMAD_ADDR=http://local-dev-gateway.staging.sqsc.squarely.io:35755 bin/nomad alloc-status 1fc1d473
ID                  = 1fc1d473
Eval ID             = 0e58a799
Name                = b84b752c001ac80dc42a9312be8e0d5c373b3731-precommand.sqsc-job[0]
Node ID             = ec2c9067
Job ID              = b84b752c001ac80dc42a9312be8e0d5c373b3731-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 06/29/17 16:29:09 UTC

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "1fc1d473-a513-ae2d-f0e3-609a96f9838e")
core@ip-10-0-0-104 ~ $ NOMAD_ADDR=http://local-dev-gateway.staging.sqsc.squarely.io:35755 bin/nomad alloc-status efeada5c
ID                  = efeada5c
Eval ID             = af894370
Name                = b84b752c001ac80dc42a9312be8e0d5c373b3731-precommand.sqsc-job[0]
Node ID             = ec2c9067
Job ID              = b84b752c001ac80dc42a9312be8e0d5c373b3731-precommand
Client Status       = pending
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created At          = 06/29/17 16:29:08 UTC

node-status shows:

core@ip-10-0-0-104 ~ $ NOMAD_ADDR=http://local-dev-gateway.staging.sqsc.squarely.io:35755 bin/nomad node-status ec2c9067 
ID      = ec2c9067
Name    = qa-noenv-20170629181501-staging-2
Class   = <none>
DC      = staging-sqsc
Drain   = false
Status  = ready
Drivers = docker,exec,raw_exec,rkt
Uptime  = 24m40s

Allocated Resources
CPU         Memory       Disk        IOPS
0/2400 MHz  0 B/995 MiB  0 B/35 GiB  0/0

Allocation Resource Utilization
CPU         Memory
0/2400 MHz  0 B/995 MiB

Host Resource Utilization
CPU           Memory           Disk
119/2400 MHz  352 MiB/995 MiB  464 MiB/36 GiB

Allocations
ID        Eval ID   Job ID                                               Task Group  Desired Status  Client Status
1fc1d473  0e58a799  b84b752c001ac80dc42a9312be8e0d5c373b3731-precommand  sqsc-job    run             pending
efeada5c  af894370  b84b752c001ac80dc42a9312be8e0d5c373b3731-precommand  sqsc-job    stop            pending
497d181e  bacf217e  b84b752c001ac80dc42a9312be8e0d5c373b3731-service     sqsc-job    run             pending

The nomad client running is available at: https://s3-eu-west-1.amazonaws.com/sqsc-release/nomad_v0.5.6%2B_linux_amd64.zip (it is version 0.5.6 patched with the commit mentioned). I checked with a digest that is the version I compiled manually.

Full client logs in debug: https://gist.github.com/mildred/e625a8ed25cae4eae4a76ab112751ed4

@mildred
Copy link
Contributor Author

mildred commented Jul 5, 2017

@dadgar could you please look into this? it seems the fix commit does not solve the issue. If this is indeed the case, could the issue be reopened?

Thank you.

@capone212
Copy link
Contributor

capone212 commented Jul 5, 2017

Hi @mildred please note last messages from my issue #2563

@schmichael provided binaries for test, I am going to try to reproduce my issue. It seems that our issues are connected, so It will be great if you try to reproduce your issue with that binaries too.

@mildred
Copy link
Contributor Author

mildred commented Jul 5, 2017

ok, I'll run the tests with the binary

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants