Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to reconnect long-running CLI commands in case of network timeout #17320

Open
josh-m-sharpe opened this issue May 25, 2023 · 8 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cli type/enhancement

Comments

@josh-m-sharpe
Copy link

This Feature Request makes "Error fetching deployment" seem like a minor nuisance.

After updating our cluster from 1.3.x to 1.5.6 I see this error every single time I run nomad run job ... - seems like a regression at this point.

@josh-m-sharpe
Copy link
Author

josh-m-sharpe commented May 25, 2023

not sure if this is related, but found them in the logs:

May 25 14:22:26 ip-10-20-21-164.us-west-2.compute.internal nomad[2608]: 2023-05-25T14:22:26.701Z [ERROR] worker: failed to dequeue evaluation: worker_id=1fe7b569-f86f-8524-f4cd-ee111b9e87b6 error="rpc error: No cluster leader"
May 25 14:22:26 ip-10-20-21-164.us-west-2.compute.internal nomad[2608]: 2023-05-25T14:22:26.701Z [ERROR] worker: failed to dequeue evaluation: worker_id=66802c36-9b61-59dc-19ec-cfd3e1c49a01 error="rpc error: No cluster leader"

which feels untrue, because every time i look at the 'servers' web UI or nomad server members I see a leader has been selected

@lgfa29
Copy link
Contributor

lgfa29 commented May 29, 2023

Hi @josh-m-sharpe 👋

From what I can tell there hasn't been no significant change in this part of the code between 1.3.x and 1.5.6, so if you're receiving an increase in this class of error I suspect that something else may be happening.

Unfortunately the CLI was omitting the actual error received, so I opened #17348 to output more information.

The No cluster leader you reported may be related to flappy leadership, and if you have a deployment being monitored while leadership changes the Error fetching deployment is expected to happen. This page details some metrics you may want to look into to determine any leadership problems.

A retry mechanism for deployment monitoring would definitely be handy, and that is covered in #12062, so I'm going to close this as a duplicate. I recommend 👍 that issue to help us with roadmaping and following it for further updates.

Feel free to open a new issue if you detect any further problem regarding unstable leadership.

Thank you for the report!

@josh-m-sharpe
Copy link
Author

Hey @lgfa29 thanks for the response. Have a bit more to add here, but it's a bit anecdotal.

I opened this issue when I encountered issues with nomad run job but I was also fiddling with restart -reschedule for other use cases and occasionally I was seeing those executions fail with a 504 Gateway Timeout error - I don't have a screenshot or output. I'd run restart -reschedule it would run for a bit , then die and output like 5-6 lines of error messaging showing that response code.

( To be clear this is NOT the same thing I reported in #17329 - even if I opened all these things near about the same time. I've been doin a lot of nomad hacking 😄 )

At no point did I see any evidence of that 504 error in my nomad server logs - which makes sense as it was a gateway timeout error. This pointed me to the AWS Application Load Balancer I had deployed in front of my nomad servers. The ALB had a (default) timeout of 60 seconds.

I replaced that with an AWS Network load balancer which has a default timeout of 350 seconds. After I made this change this issue appears to have gone away.

This does mean my issue is largely resolved. However, it does signal to me that something between 1.3.x and 1.5.6 started taking longer than 60 seconds to respond - which is a heck of lot of time.

Anyways, sorry I don't have any more hard evidence, just wanted to convey what I figured out. Cheers!

@josh-m-sharpe
Copy link
Author

Now that I think about it more, I wish I knew if the restart -reschedule died right around 60 seconds. I want to say it didn't take that long but maybe it did. Is it possible the CLI is/was opening a connection and holding it open while it polls?

@lgfa29
Copy link
Contributor

lgfa29 commented May 29, 2023

This does mean my issue is largely resolved. However, it does signal to me that something between 1.3.x and 1.5.6 started taking longer than 60 seconds to respond - which is a heck of lot of time.

Hum...that's interesting, I can't think of any change in this regard, and the Nomad API should be using a keep-alive timeout of 30s to keep the connection open.

daa9824 switched the api client (which the Nomad CLI uses) to use pooled connections, but I think this was also the case in 1.3.x.

Is it possible the CLI is/was opening a connection and holding it open while it polls?

Yes, the CLI reuses the same connection, there's a bit more info here:

nomad/api/api.go

Lines 489 to 496 in 087ac3a

// Close closes the client's idle keep-alived connections. The default
// client configuration uses keep-alive to maintain connections and
// you should instantiate a single Client and reuse it for all
// requests from the same host. Connections will be closed
// automatically once the client is garbage collected. If you are
// creating multiple clients on the same host (for example, for
// testing), it may be useful to call Close() to avoid hitting
// connection limits.

Maybe we could try to create a new connection in case of a network timeout?

I think I will reword the title for this issue and keep it open for us to further investigate this possibility, thanks for the extra info!

@lgfa29 lgfa29 reopened this May 29, 2023
@lgfa29 lgfa29 changed the title "Error fetching deployment" no longer a minor issue Attempt to reconnect long-running CLI commands in case of network timeout May 29, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented May 29, 2023

I've been doin a lot of nomad hacking

I forgot to mention in the previous message, but I'm also curious about this 😄

@lgfa29 lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/duplicate labels May 30, 2023
@kaspergrubbe
Copy link

We've recently upgraded from an old 1.0.18 deployment to a much newer (and upgraded) 1.7.2 cluster, and we're also seeing 504 issues too now.

We're behind 2 loadbalancers: an AWS ELB and a Haproxy running within the Nomad cluster reaching the Nomad APIs.

This is what the CLI spews out in our CI/CD pipeline:

2024-01-03T13:14:08+01:00
ID          = ******
Job ID      = ourapp-production-web
Job Version = ****
Status      = running
Description = Deployment is running pending automatic promotion
Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
puma        false     10       4         4       0        0          2024-01-03T12:29:08Z
==> 2024-01-03T13:14:58+01:00: Error fetching deployment: Unexpected response code: 504 (<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>)

I think ELBs have a timeout of 60 seconds, while our Haproxy have a default of 50 seconds, maybe we should use the HTTP API directly instead of Nomad CLI in these cases?

@Blefish
Copy link

Blefish commented Jan 3, 2024

I'm also running into this as I was exploring using Nomad CLI to perform some application deployments via CI

Previously I was monitoring deployment status using Ansible and accessing HTTP API every x seconds, but Nomad CLI is much more useful in terms of deployment status and visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cli type/enhancement
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants