Freeze During `nomad status <alloc>` In 0.9.0-beta3 #5367

Miserlou · 2019-02-27T19:19:57Z

Upgrading from 0.8.* to 0.9.0-beta3 causes hanging during nomad status <alloc>, ex (from a client node):

$ nomad status 908bbc3b
ID                  = 908bbc3b
Eval ID             = c3b00d38
Name                = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93.jobs[0]
Node ID             = a663bfa6
Job ID              = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2m27s ago
Modified            = 2m27s ago

^C
$ nomad status 908bbc3b
ID                  = 908bbc3b
Eval ID             = c3b00d38
Name                = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93.jobs[0]
Node ID             = a663bfa6
Job ID              = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 5m42s ago
Modified            = 5m42s ago

This seems to be new behavior.

It also seems like the overall state of our system has gone from "mostly working" (0.8.3) to "mostly not working" (0.9.0-beta3).

The text was updated successfully, but these errors were encountered:

Miserlou · 2019-02-27T19:25:41Z

It generally seems like the status is stuck in pending when this happens:

ubuntu@ip-10-0-0-191:~$ nomad status SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
ID            = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Name          = SURVEYOR_DISPATCHER/dispatch-1551294621-50f1df93
Submit Date   = 2019-02-27T19:10:21Z
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
jobs        0       1         0        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
908bbc3b  a663bfa6  jobs        0        run      pending  12m50s ago  12m50s ago

I'll also now that I've now tried three virgin 0.9.0-beta3 clusters and all have shown different failure patterns which weren't present in 0.8.3. This error is the third of the three.

cgbaker · 2019-02-27T19:30:35Z

@Miserlou , you mention in the first comment that you saw this when upgrading the cluster from 0.8.x to 0.9.0-beta3, and in the second comment that this happens on virgin 0.9.0-beta3 clusters as well. I want to make sure that these are both the case as I investigate.

cgbaker · 2019-02-27T19:56:55Z

@Miserlou , is it possible to get client logs?

Miserlou · 2019-02-27T20:42:05Z

Both scenarios were tested and failed, I don't have the results of the upgrade scenario but it failed for a different reason (docker driver related, I commented on another ticket about that). We have a staging stack that is essentially identical to our prod stack running 0.8.3 (which runs okay), only the staging stack tried 0.9.0-b3 (which didn't run ok).

It actually looked like the logging system may have been affected by the freezing as well, since the logs which I would normally expect to be sent to our CloudWatch Logs streams from Docker weren't there.

Miserlou · 2019-02-27T20:45:26Z

The other one I saw: #4934

Miserlou · 2019-02-27T20:49:15Z

Actually the one other difference is that the 0.9.0-beta3 deployment uses the new tmpfs features (which is the reason for our need to upgrade at all). I don't know if this is related or not, but since this seems to be related to the Docker driver then it could be a possibility.

Miserlou · 2019-02-27T20:50:59Z

        mounts = [
          {
            type = "tmpfs"
            target = "/home/user/data_store_tmpfs"
            readonly = false
            tmpfs_options {
              size = 17179869184 # size in bytes (17GB)
            }
          }
]

Although I primarily saw this behavior on jobs which didn't use the new tmpfs features, so I don't know if that makes sense as a source of the error.

cgbaker · 2019-02-27T21:21:57Z

Thank you, @Miserlou.

cgbaker · 2019-03-05T18:01:08Z

Hey @Miserlou , I didn't forget about this. I spent some time trying to see whether this was possibly related to the docker image pull issue. I will look into this a bit more and get back to you if there's any info that I need. Thanks for the report.

Miserlou · 2019-03-05T18:08:48Z

Great, thanks. I'm not sure if it is or not.

Our project is open source, you could try spinning up our stack to try to repro.

endocrimes · 2019-03-13T12:11:01Z

@Miserlou Hey, I'm trying to repro this and am also coming up fairly short, but there are a couple of things that would be useful to know:

What's the relationship of the node you're running nomad status on to the client that is running the allocation? (e.g same client, different client, totally unrelated node)
Roughly how many allocations are you running on the same node?
What is the output of nomad node-status $NODEID for a node that experiences this?

I've tried running several hundred jobs similar to your SURVEYOR_DISPATCHER (e.g the same, but running a plain old debian image with a sleep) on a single node, and although I got a tiny bit of a slowdown in some cases, it was nothing close to what you're experiencing here.

The missing logs you mention are also interesting here, and makes me wonder if there's a docker daemon or networking issue that's causing something to run unexpectedly slowly, which we should handle, but those cases are hard to find. Although if you see this consistently across different hosts then it may be a red herring.

Miserlou · 2019-03-13T14:30:37Z

That test had three dedicated servers and 10 nodes, each with 976GiB RAM, all the same Nomad software and I'm 99% sure it's the same Docker version that comes with Ubuntu. Allocations per node varies but can be in the hundreds. We have downgraded back to 0.8.3 so I can't give you node-status anymore.

Docker logging is set to go to AWS CloudWatch logs, ex:

    logging {
            type = "awslogs"
            config {
             awslogs-region = "us-east-1",
             awslogs-group = "my-log-group-dev",
             awslogs-stream = "my-log-stream-dev"
           }
         }

schmichael · 2019-03-18T20:44:17Z

Thanks for all of the testing and information @Miserlou ! We think #5420 fixed it. RC should be coming out shortly, but I created a one-off amd64 Linux build if you want to test it!

nomad-b3bedeb33-linux_amd64.gz

Going to close, but please reopen if the issue persists.

github-actions · 2022-11-25T02:20:58Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cgbaker added the stage/needs-investigation label Feb 27, 2019

cgbaker added theme/driver/docker theme/client labels Feb 27, 2019

Miserlou mentioned this issue Feb 28, 2019

Revert Nomad from 0.9.0-beta3 to 0.8.3 AlexsLemonade/refinebio#1096

Merged

cgbaker self-assigned this Mar 4, 2019

schmichael closed this as completed Mar 18, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeze During `nomad status <alloc>` In 0.9.0-beta3 #5367

Freeze During `nomad status <alloc>` In 0.9.0-beta3 #5367

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

cgbaker commented Feb 27, 2019 •

edited

Loading

cgbaker commented Feb 27, 2019

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

cgbaker commented Feb 27, 2019

cgbaker commented Mar 5, 2019

Miserlou commented Mar 5, 2019

endocrimes commented Mar 13, 2019

Miserlou commented Mar 13, 2019

schmichael commented Mar 18, 2019

github-actions bot commented Nov 25, 2022

Freeze During nomad status <alloc> In 0.9.0-beta3 #5367

Freeze During nomad status <alloc> In 0.9.0-beta3 #5367

Comments

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

cgbaker commented Feb 27, 2019 • edited Loading

cgbaker commented Feb 27, 2019

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

Miserlou commented Feb 27, 2019

cgbaker commented Feb 27, 2019

cgbaker commented Mar 5, 2019

Miserlou commented Mar 5, 2019

endocrimes commented Mar 13, 2019

Miserlou commented Mar 13, 2019

schmichael commented Mar 18, 2019

github-actions bot commented Nov 25, 2022

Freeze During `nomad status <alloc>` In 0.9.0-beta3 #5367

Freeze During `nomad status <alloc>` In 0.9.0-beta3 #5367

cgbaker commented Feb 27, 2019 •

edited

Loading