Skip to content
This repository has been archived by the owner on Feb 26, 2023. It is now read-only.

EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

Open
thisisjaid opened this issue Mar 23, 2020 · 17 comments
Open

EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

thisisjaid opened this issue Mar 23, 2020 · 17 comments

Comments

@thisisjaid
Copy link

We've just upgraded our staging environment to Nomad 0.10.4 and post-upgrade we have started seeing seemingly random errors in HashiUI on the Nomad UI side of the app. The full error is received in the UI is:

Get http://172.17.0.1:4646/v1/job/some-job-name?namespace=default&region=global&stale=&wait=120000ms: EOF

Network communication across the board seems ok, we've tested manual requests to the API from both within the HashiUI container and outside, we've also tested bypassing the nginx load-balancer that upstreams to the HashiUI app and we get the same result.

Nomad 0.10.4
Consul 1.7.2
docker 17.04.0-ce
hashiui image - jippi/hashi-ui:pr-556

Anyone else seeing anything similar with 0.10.4?

@MattPumphrey
Copy link

We are seeing this with Nomad 0.10.3 and 0.10.5

@bizonek27
Copy link

Same problem on 0.11.0 and 0.11.1

@thisisjaid
Copy link
Author

We've reverted to using the built-in UIs because of this issue, despite our best efforts we've been unable to debug where the problem lies and couldn't justify investing more time.

@jippi
Copy link
Owner

jippi commented Apr 29, 2020

It's a bug in the Nomad API SDK - I believe they might have fixed it recently, let me check

@jippi
Copy link
Owner

jippi commented Apr 29, 2020

I believe this is the fix hashicorp/nomad#5970

@thisisjaid
Copy link
Author

@jippi that seems to be an older bug though and the calls we are seeing returning EOF don't seem to have anything to do with the GC/GcAlloc endpoint, unless there's some underlying logic I am missing?

@jippi
Copy link
Owner

jippi commented Apr 29, 2020

I've pushed hashi-ui:pr-566 up if someone wants to take it for a spin - I can't reproduce it locally or in our prod environments

@thisisjaid
Copy link
Author

@jippi I've just deployed pr-566 but can see no difference. My initial suspicion was something network-related that may terminate connections prematurely but running Hashi-UI both as a docker container and as a binary on the system directly as well as taking any load-balancing out of the equation did not solve the problem.

There is another clue at least in our case that this seems start happening after a short while scrolling up and down on the Services screen for example. It then errors for a while and then start working again for a bit and so on. Some form of rate-limiting was my initial instinct but I couldn't find any sort of mechanism that could be causing that behavior. So I'm still in the dark as to what the actual cause may be.

@melkorm
Copy link

melkorm commented May 11, 2020

Hey 👋
After some investigation and testing with @bizonek27 we think we found the cause of EOFs.

This issue affects all nomad versions from 0.10.3 https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#0103-january-29-2020 where hashicorp/nomad#7002 was introduced

Findings

We started analysing what have changed in nomad releases lately, last reported version with this issue in this thread is 0.10.3 (thanks @MattPumphrey), so we found this:
agent: Added unauthenticated connection timeouts and limits to prevent resource exhaustion. CVE-2020-7218 [GH-7002] so we checked pull requested with this change https://github.com/hashicorp/nomad/pull/7022/files and as you can see connection limits or timeouts are tested agains io.EOF.

Unfortunately this is breaking change from Nomad side as they put default limits too low:

limits {
  https_handshake_timeout   = "5s"
  http_max_conns_per_client = 100
  rpc_handshake_timeout     = "5s"
  rpc_max_conns_per_client  = 100
}

We think why only some people could face EOF errors is due limit of jobs they are running, in our situation we run quite a lot of them that's why it appeared quickly for us, we also tested the limit with lower amount of jobs with default limits which resulted with no errors, but changing the limits to 50 triggered EOF errors.

Fix

Easiest fix for now is to change nomad'a agent limits to higher or 0 (no limits) - be advised that it could result in DoS as reported in hashicorp/nomad#7002

For hashi-ui itself we could add connection limits as configuration or re-use them better to not exhaust them so quickly.

@thisisjaid
Copy link
Author

Hah that's some nice sleuthing @melkorm. Thank you! We've got a pretty large number of jobs so I'm guessing that's why we saw it straight off.

Am I being incredibly dumb or is this completely lacking from Nomad documentation? We went over it with a fine tooth comb to make sure we weren't missing any limit settings and I've just done it again now and I still can't find any reference to these configuration parameters.

@melkorm
Copy link

melkorm commented May 11, 2020

@thisisjaid https://www.nomadproject.io/docs/configuration/#limits it's in agent's configuration docs

Unfortunately Nomad absolutely fails at versioning their docs and finding out which features landed when ends with going trough changelog :/

Once I crashed whole cluster adding configuration for logs in json format 🙈 as we had lower version than the version when this option was introduced.

@thisisjaid
Copy link
Author

Bah the one damn page I didn't go through as I assumed it was just generic information based on the misleading header ("Overview"). Agreed on doc versioning I've had trouble with that as well with the auto_revert job spec option if I remember correctly. I've recently pointed out some other issues with telemetry config documentation. Nomad docs could generally use some work.

Any case, thanks a bunch for tracking this down! This can probably be closed now.

@alievrouw
Copy link

Has anyone experimented with increasing HTTP and RPC limits without setting them to 0? I suspect it's very environment dependent (number of jobs running, potentially resources allocated to hashi-ui, etc). I have been experiencing this in our prod environment since upgrading to 10.4, and deployed new limits of 500 for each HTTP and RPC today. So far the errors have stopped. I'm curious if this will be a fix, or just a band-aid that will keep the errors from popping up as quickly. I'm hoping it will smooth things over until the number of jobs we have running significantly increases. Would appreciate anyone's thoughts! And great find @melkorm !!

@melkorm
Copy link

melkorm commented Jul 25, 2020

We still keep this at 0 as we are in control of what talks to our nomad agents. I think that the best solution to this problem is, as you said @alievrouw , to set those limits according to your environment and monitor the situation.

I think we can close this issue, perhaps we could add something to readme about it if anyone runs into it in the future.

@mnuic
Copy link

mnuic commented Oct 12, 2020

We have raised Nomad limits, but with too many services, this still happens. The biggest problem is, when there is an outage in production, Nomad is rescheduling many services to new hosts and at the same time our team is tshooting the outage, result is too many connections open to Nomad servers, that the Nomad itself becomes unstable.

Is there a way to reuse hashi-ui connections to Nomad, so that there isnt't so many? I mean we can switch to Nomad offical UI, but it lacks a lot of stuff.

@ekbfh
Copy link

ekbfh commented Nov 27, 2020

I have the same problem, but i haven't any Nomad, only consul.

I increased http_max_conns_per_client in consul config to 1000 and it seems EOF errors has gone away

@rlanyi
Copy link

rlanyi commented Mar 27, 2021

I still see this issue with nomad 1.0.3 and consul 1.9.3

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants