-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze During nomad status <alloc>
In 0.9.0-beta3
#5367
Comments
It generally seems like the status is stuck in pending when this happens:
I'll also now that I've now tried three virgin 0.9.0-beta3 clusters and all have shown different failure patterns which weren't present in 0.8.3. This error is the third of the three. |
@Miserlou , you mention in the first comment that you saw this when upgrading the cluster from 0.8.x to 0.9.0-beta3, and in the second comment that this happens on virgin 0.9.0-beta3 clusters as well. I want to make sure that these are both the case as I investigate. |
@Miserlou , is it possible to get client logs? |
Both scenarios were tested and failed, I don't have the results of the upgrade scenario but it failed for a different reason (docker driver related, I commented on another ticket about that). We have a staging stack that is essentially identical to our prod stack running 0.8.3 (which runs okay), only the staging stack tried 0.9.0-b3 (which didn't run ok). It actually looked like the logging system may have been affected by the freezing as well, since the logs which I would normally expect to be sent to our CloudWatch Logs streams from Docker weren't there. |
The other one I saw: #4934 |
Actually the one other difference is that the 0.9.0-beta3 deployment uses the new |
Although I primarily saw this behavior on jobs which didn't use the new |
Thank you, @Miserlou. |
Hey @Miserlou , I didn't forget about this. I spent some time trying to see whether this was possibly related to the docker image pull issue. I will look into this a bit more and get back to you if there's any info that I need. Thanks for the report. |
Great, thanks. I'm not sure if it is or not. Our project is open source, you could try spinning up our stack to try to repro. |
@Miserlou Hey, I'm trying to repro this and am also coming up fairly short, but there are a couple of things that would be useful to know:
I've tried running several hundred jobs similar to your SURVEYOR_DISPATCHER (e.g the same, but running a plain old debian image with a sleep) on a single node, and although I got a tiny bit of a slowdown in some cases, it was nothing close to what you're experiencing here. The missing logs you mention are also interesting here, and makes me wonder if there's a docker daemon or networking issue that's causing something to run unexpectedly slowly, which we should handle, but those cases are hard to find. Although if you see this consistently across different hosts then it may be a red herring. |
That test had three dedicated servers and 10 nodes, each with 976GiB RAM, all the same Nomad software and I'm 99% sure it's the same Docker version that comes with Ubuntu. Allocations per node varies but can be in the hundreds. We have downgraded back to 0.8.3 so I can't give you node-status anymore. Docker logging is set to go to AWS CloudWatch logs, ex:
|
Thanks for all of the testing and information @Miserlou ! We think #5420 fixed it. RC should be coming out shortly, but I created a one-off amd64 Linux build if you want to test it! nomad-b3bedeb33-linux_amd64.gz Going to close, but please reopen if the issue persists. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Upgrading from 0.8.* to 0.9.0-beta3 causes hanging during
nomad status <alloc>
, ex (from a client node):This seems to be new behavior.
It also seems like the overall state of our system has gone from "mostly working" (0.8.3) to "mostly not working" (0.9.0-beta3).
The text was updated successfully, but these errors were encountered: