-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to reconnect long-running CLI commands in case of network timeout #17320
Comments
not sure if this is related, but found them in the logs:
which feels untrue, because every time i look at the 'servers' web UI or |
Hi @josh-m-sharpe 👋 From what I can tell there hasn't been no significant change in this part of the code between 1.3.x and 1.5.6, so if you're receiving an increase in this class of error I suspect that something else may be happening. Unfortunately the CLI was omitting the actual error received, so I opened #17348 to output more information. The A retry mechanism for deployment monitoring would definitely be handy, and that is covered in #12062, so I'm going to close this as a duplicate. I recommend 👍 that issue to help us with roadmaping and following it for further updates. Feel free to open a new issue if you detect any further problem regarding unstable leadership. Thank you for the report! |
Hey @lgfa29 thanks for the response. Have a bit more to add here, but it's a bit anecdotal. I opened this issue when I encountered issues with ( To be clear this is NOT the same thing I reported in #17329 - even if I opened all these things near about the same time. I've been doin a lot of nomad hacking 😄 ) At no point did I see any evidence of that 504 error in my nomad server logs - which makes sense as it was a gateway timeout error. This pointed me to the AWS Application Load Balancer I had deployed in front of my nomad servers. The ALB had a (default) timeout of 60 seconds. I replaced that with an AWS Network load balancer which has a default timeout of 350 seconds. After I made this change this issue appears to have gone away. This does mean my issue is largely resolved. However, it does signal to me that something between 1.3.x and 1.5.6 started taking longer than 60 seconds to respond - which is a heck of lot of time. Anyways, sorry I don't have any more hard evidence, just wanted to convey what I figured out. Cheers! |
Now that I think about it more, I wish I knew if the |
Hum...that's interesting, I can't think of any change in this regard, and the Nomad API should be using a keep-alive timeout of 30s to keep the connection open. daa9824 switched the
Yes, the CLI reuses the same connection, there's a bit more info here: Lines 489 to 496 in 087ac3a
Maybe we could try to create a new connection in case of a network timeout? I think I will reword the title for this issue and keep it open for us to further investigate this possibility, thanks for the extra info! |
I forgot to mention in the previous message, but I'm also curious about this 😄 |
We've recently upgraded from an old 1.0.18 deployment to a much newer (and upgraded) 1.7.2 cluster, and we're also seeing 504 issues too now. We're behind 2 loadbalancers: an AWS ELB and a Haproxy running within the Nomad cluster reaching the Nomad APIs. This is what the CLI spews out in our CI/CD pipeline:
I think ELBs have a timeout of 60 seconds, while our Haproxy have a default of 50 seconds, maybe we should use the HTTP API directly instead of Nomad CLI in these cases? |
I'm also running into this as I was exploring using Nomad CLI to perform some application deployments via CI Previously I was monitoring deployment status using Ansible and accessing HTTP API every x seconds, but Nomad CLI is much more useful in terms of deployment status and visibility. |
This Feature Request makes "Error fetching deployment" seem like a minor nuisance.
After updating our cluster from 1.3.x to 1.5.6 I see this error every single time I run
nomad run job ...
- seems like a regression at this point.The text was updated successfully, but these errors were encountered: