EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

thisisjaid · 2020-03-23T17:39:25Z

We've just upgraded our staging environment to Nomad 0.10.4 and post-upgrade we have started seeing seemingly random errors in HashiUI on the Nomad UI side of the app. The full error is received in the UI is:

Get http://172.17.0.1:4646/v1/job/some-job-name?namespace=default&region=global&stale=&wait=120000ms: EOF

Network communication across the board seems ok, we've tested manual requests to the API from both within the HashiUI container and outside, we've also tested bypassing the nginx load-balancer that upstreams to the HashiUI app and we get the same result.

Nomad 0.10.4
Consul 1.7.2
docker 17.04.0-ce
hashiui image - jippi/hashi-ui:pr-556

Anyone else seeing anything similar with 0.10.4?

The text was updated successfully, but these errors were encountered:

MattPumphrey · 2020-04-03T15:39:28Z

We are seeing this with Nomad 0.10.3 and 0.10.5

bizonek27 · 2020-04-29T07:09:31Z

Same problem on 0.11.0 and 0.11.1

thisisjaid · 2020-04-29T13:05:07Z

We've reverted to using the built-in UIs because of this issue, despite our best efforts we've been unable to debug where the problem lies and couldn't justify investing more time.

jippi · 2020-04-29T13:22:16Z

It's a bug in the Nomad API SDK - I believe they might have fixed it recently, let me check

jippi · 2020-04-29T13:24:02Z

I believe this is the fix hashicorp/nomad#5970

thisisjaid · 2020-04-29T13:39:44Z

@jippi that seems to be an older bug though and the calls we are seeing returning EOF don't seem to have anything to do with the GC/GcAlloc endpoint, unless there's some underlying logic I am missing?

jippi · 2020-04-29T13:40:16Z

I've pushed hashi-ui:pr-566 up if someone wants to take it for a spin - I can't reproduce it locally or in our prod environments

thisisjaid · 2020-04-29T13:50:01Z

@jippi I've just deployed pr-566 but can see no difference. My initial suspicion was something network-related that may terminate connections prematurely but running Hashi-UI both as a docker container and as a binary on the system directly as well as taking any load-balancing out of the equation did not solve the problem.

There is another clue at least in our case that this seems start happening after a short while scrolling up and down on the Services screen for example. It then errors for a while and then start working again for a bit and so on. Some form of rate-limiting was my initial instinct but I couldn't find any sort of mechanism that could be causing that behavior. So I'm still in the dark as to what the actual cause may be.

melkorm · 2020-05-11T14:26:17Z

Hey 👋
After some investigation and testing with @bizonek27 we think we found the cause of EOFs.

This issue affects all nomad versions from 0.10.3 https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#0103-january-29-2020 where hashicorp/nomad#7002 was introduced

Findings

We started analysing what have changed in nomad releases lately, last reported version with this issue in this thread is 0.10.3 (thanks @MattPumphrey), so we found this:
agent: Added unauthenticated connection timeouts and limits to prevent resource exhaustion. CVE-2020-7218 [GH-7002] so we checked pull requested with this change https://github.com/hashicorp/nomad/pull/7022/files and as you can see connection limits or timeouts are tested agains io.EOF.

Unfortunately this is breaking change from Nomad side as they put default limits too low:

limits {
  https_handshake_timeout   = "5s"
  http_max_conns_per_client = 100
  rpc_handshake_timeout     = "5s"
  rpc_max_conns_per_client  = 100
}

We think why only some people could face EOF errors is due limit of jobs they are running, in our situation we run quite a lot of them that's why it appeared quickly for us, we also tested the limit with lower amount of jobs with default limits which resulted with no errors, but changing the limits to 50 triggered EOF errors.

Fix

Easiest fix for now is to change nomad'a agent limits to higher or 0 (no limits) - be advised that it could result in DoS as reported in hashicorp/nomad#7002

For hashi-ui itself we could add connection limits as configuration or re-use them better to not exhaust them so quickly.

thisisjaid · 2020-05-11T14:33:54Z

Hah that's some nice sleuthing @melkorm. Thank you! We've got a pretty large number of jobs so I'm guessing that's why we saw it straight off.

Am I being incredibly dumb or is this completely lacking from Nomad documentation? We went over it with a fine tooth comb to make sure we weren't missing any limit settings and I've just done it again now and I still can't find any reference to these configuration parameters.

melkorm · 2020-05-11T14:38:01Z

@thisisjaid https://www.nomadproject.io/docs/configuration/#limits it's in agent's configuration docs

Unfortunately Nomad absolutely fails at versioning their docs and finding out which features landed when ends with going trough changelog :/

Once I crashed whole cluster adding configuration for logs in json format 🙈 as we had lower version than the version when this option was introduced.

thisisjaid · 2020-05-11T14:57:38Z

Bah the one damn page I didn't go through as I assumed it was just generic information based on the misleading header ("Overview"). Agreed on doc versioning I've had trouble with that as well with the auto_revert job spec option if I remember correctly. I've recently pointed out some other issues with telemetry config documentation. Nomad docs could generally use some work.

Any case, thanks a bunch for tracking this down! This can probably be closed now.

alievrouw · 2020-06-22T21:01:16Z

Has anyone experimented with increasing HTTP and RPC limits without setting them to 0? I suspect it's very environment dependent (number of jobs running, potentially resources allocated to hashi-ui, etc). I have been experiencing this in our prod environment since upgrading to 10.4, and deployed new limits of 500 for each HTTP and RPC today. So far the errors have stopped. I'm curious if this will be a fix, or just a band-aid that will keep the errors from popping up as quickly. I'm hoping it will smooth things over until the number of jobs we have running significantly increases. Would appreciate anyone's thoughts! And great find @melkorm !!

melkorm · 2020-07-25T17:26:21Z

We still keep this at 0 as we are in control of what talks to our nomad agents. I think that the best solution to this problem is, as you said @alievrouw , to set those limits according to your environment and monitor the situation.

I think we can close this issue, perhaps we could add something to readme about it if anyone runs into it in the future.

mnuic · 2020-10-12T09:05:04Z

We have raised Nomad limits, but with too many services, this still happens. The biggest problem is, when there is an outage in production, Nomad is rescheduling many services to new hosts and at the same time our team is tshooting the outage, result is too many connections open to Nomad servers, that the Nomad itself becomes unstable.

Is there a way to reuse hashi-ui connections to Nomad, so that there isnt't so many? I mean we can switch to Nomad offical UI, but it lacks a lot of stuff.

ekbfh · 2020-11-27T14:42:11Z

I have the same problem, but i haven't any Nomad, only consul.

I increased http_max_conns_per_client in consul config to 1000 and it seems EOF errors has gone away

rlanyi · 2021-03-27T21:59:25Z

I still see this issue with nomad 1.0.3 and consul 1.9.3

mnuic mentioned this issue Oct 12, 2020

GUI bug and unknown allocation id #439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

thisisjaid commented Mar 23, 2020

MattPumphrey commented Apr 3, 2020

bizonek27 commented Apr 29, 2020

thisisjaid commented Apr 29, 2020

jippi commented Apr 29, 2020

jippi commented Apr 29, 2020

thisisjaid commented Apr 29, 2020

jippi commented Apr 29, 2020

thisisjaid commented Apr 29, 2020

melkorm commented May 11, 2020 •

edited

Loading

thisisjaid commented May 11, 2020

melkorm commented May 11, 2020 •

edited

Loading

thisisjaid commented May 11, 2020

alievrouw commented Jun 22, 2020

melkorm commented Jul 25, 2020

mnuic commented Oct 12, 2020

ekbfh commented Nov 27, 2020

rlanyi commented Mar 27, 2021

EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

EOF/connection reset errors in UI w/ Nomad 0.10.4 #557

Comments

thisisjaid commented Mar 23, 2020

MattPumphrey commented Apr 3, 2020

bizonek27 commented Apr 29, 2020

thisisjaid commented Apr 29, 2020

jippi commented Apr 29, 2020

jippi commented Apr 29, 2020

thisisjaid commented Apr 29, 2020

jippi commented Apr 29, 2020

thisisjaid commented Apr 29, 2020

melkorm commented May 11, 2020 • edited Loading

Findings

Fix

thisisjaid commented May 11, 2020

melkorm commented May 11, 2020 • edited Loading

thisisjaid commented May 11, 2020

alievrouw commented Jun 22, 2020

melkorm commented Jul 25, 2020

mnuic commented Oct 12, 2020

ekbfh commented Nov 27, 2020

rlanyi commented Mar 27, 2021

melkorm commented May 11, 2020 •

edited

Loading

melkorm commented May 11, 2020 •

edited

Loading