Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze During nomad status <alloc> In 0.9.0-beta3 #5367

Closed
Miserlou opened this issue Feb 27, 2019 · 14 comments
Closed

Freeze During nomad status <alloc> In 0.9.0-beta3 #5367

Miserlou opened this issue Feb 27, 2019 · 14 comments

Comments

@Miserlou
Copy link

Upgrading from 0.8.* to 0.9.0-beta3 causes hanging during nomad status <alloc>, ex (from a client node):

$ nomad status 908bbc3b
ID                  = 908bbc3b
Eval ID             = c3b00d38
Name                = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93.jobs[0]
Node ID             = a663bfa6
Job ID              = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2m27s ago
Modified            = 2m27s ago

^C
$ nomad status 908bbc3b
ID                  = 908bbc3b
Eval ID             = c3b00d38
Name                = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93.jobs[0]
Node ID             = a663bfa6
Job ID              = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 5m42s ago
Modified            = 5m42s ago

This seems to be new behavior.

It also seems like the overall state of our system has gone from "mostly working" (0.8.3) to "mostly not working" (0.9.0-beta3).

@Miserlou
Copy link
Author

It generally seems like the status is stuck in pending when this happens:

ubuntu@ip-10-0-0-191:~$ nomad status SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
ID            = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Name          = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Submit Date   = 2019-02-27T19:10:21Z
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
jobs        0       1         0        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
908bbc3b  a663bfa6  jobs        0        run      pending  12m50s ago  12m50s ago

I'll also now that I've now tried three virgin 0.9.0-beta3 clusters and all have shown different failure patterns which weren't present in 0.8.3. This error is the third of the three.

@cgbaker
Copy link
Contributor

cgbaker commented Feb 27, 2019

@Miserlou , you mention in the first comment that you saw this when upgrading the cluster from 0.8.x to 0.9.0-beta3, and in the second comment that this happens on virgin 0.9.0-beta3 clusters as well. I want to make sure that these are both the case as I investigate.

@cgbaker
Copy link
Contributor

cgbaker commented Feb 27, 2019

@Miserlou , is it possible to get client logs?

@Miserlou
Copy link
Author

Both scenarios were tested and failed, I don't have the results of the upgrade scenario but it failed for a different reason (docker driver related, I commented on another ticket about that). We have a staging stack that is essentially identical to our prod stack running 0.8.3 (which runs okay), only the staging stack tried 0.9.0-b3 (which didn't run ok).

It actually looked like the logging system may have been affected by the freezing as well, since the logs which I would normally expect to be sent to our CloudWatch Logs streams from Docker weren't there.

@Miserlou
Copy link
Author

The other one I saw: #4934

@Miserlou
Copy link
Author

Actually the one other difference is that the 0.9.0-beta3 deployment uses the new tmpfs features (which is the reason for our need to upgrade at all). I don't know if this is related or not, but since this seems to be related to the Docker driver then it could be a possibility.

@Miserlou
Copy link
Author

        mounts = [
          {
            type = "tmpfs"
            target = "/home/user/data_store_tmpfs"
            readonly = false
            tmpfs_options {
              size = 17179869184 # size in bytes (17GB)
            }
          }
]

Although I primarily saw this behavior on jobs which didn't use the new tmpfs features, so I don't know if that makes sense as a source of the error.

@cgbaker
Copy link
Contributor

cgbaker commented Feb 27, 2019

Thank you, @Miserlou.

@cgbaker
Copy link
Contributor

cgbaker commented Mar 5, 2019

Hey @Miserlou , I didn't forget about this. I spent some time trying to see whether this was possibly related to the docker image pull issue. I will look into this a bit more and get back to you if there's any info that I need. Thanks for the report.

@Miserlou
Copy link
Author

Miserlou commented Mar 5, 2019

Great, thanks. I'm not sure if it is or not.

Our project is open source, you could try spinning up our stack to try to repro.

@endocrimes
Copy link
Contributor

@Miserlou Hey, I'm trying to repro this and am also coming up fairly short, but there are a couple of things that would be useful to know:

  1. What's the relationship of the node you're running nomad status on to the client that is running the allocation? (e.g same client, different client, totally unrelated node)
  2. Roughly how many allocations are you running on the same node?
  3. What is the output of nomad node-status $NODEID for a node that experiences this?

I've tried running several hundred jobs similar to your SURVEYOR_DISPATCHER (e.g the same, but running a plain old debian image with a sleep) on a single node, and although I got a tiny bit of a slowdown in some cases, it was nothing close to what you're experiencing here.

The missing logs you mention are also interesting here, and makes me wonder if there's a docker daemon or networking issue that's causing something to run unexpectedly slowly, which we should handle, but those cases are hard to find. Although if you see this consistently across different hosts then it may be a red herring.

@Miserlou
Copy link
Author

That test had three dedicated servers and 10 nodes, each with 976GiB RAM, all the same Nomad software and I'm 99% sure it's the same Docker version that comes with Ubuntu. Allocations per node varies but can be in the hundreds. We have downgraded back to 0.8.3 so I can't give you node-status anymore.

Docker logging is set to go to AWS CloudWatch logs, ex:

    logging {
            type = "awslogs"
            config {
             awslogs-region = "us-east-1",
             awslogs-group = "my-log-group-dev",
             awslogs-stream = "my-log-stream-dev"
           }
         }

@schmichael
Copy link
Member

Thanks for all of the testing and information @Miserlou ! We think #5420 fixed it. RC should be coming out shortly, but I created a one-off amd64 Linux build if you want to test it!

nomad-b3bedeb33-linux_amd64.gz

Going to close, but please reopen if the issue persists.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants