Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection issues with connect native tasks in bridge networking #10933

Closed
shoenig opened this issue Jul 23, 2021 · 3 comments · Fixed by #10951
Closed

connection issues with connect native tasks in bridge networking #10933

shoenig opened this issue Jul 23, 2021 · 3 comments · Fixed by #10951
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Milestone

Comments

@shoenig
Copy link
Member

shoenig commented Jul 23, 2021

As pointed out by @apollo13 in #10804 (comment)

Jul 23 18:31:58 sfo3 nomad[1297]:     2021-07-23T18:31:58.342Z [WARN]  client.alloc_runner.runner_hook: error proxying from Consul: alloc_id=e39dffd5-b719-f4f2-09ee-891e9d96520a error="write unix /opt/nomad/data/alloc/e39dffd5-b719-f4f2-09ee-891e9d96520a/alloc/tmp/consul_http.sock->@: write: broken pipe" dest=127.0.0.1:8501 src_local=/opt/nomad/data/alloc/e39dffd5-b719-f4f2-09ee-891e9d96520a/alloc/tmp/consul_http.sock src_remote=@ bytes=2466

I believe those errors are coming from the consul_grpc_sock hook, which I don't think even needs to be present for non-sidecar using connect tasks? Although it's also suspicious there are connection errors...

JK it's the consul_http_sock hook using shared code from the other one.

@shoenig shoenig self-assigned this Jul 26, 2021
@shoenig shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jul 26, 2021
@shoenig shoenig added this to the 1.1.3 milestone Jul 26, 2021
@apollo13
Copy link
Contributor

Hi @shoenig -- I wonder if this is traefik's fault (or can you reproduce this with arbitrary connect native jobs?). If the connect native job were to tear down the connection to the socket mid response etc then this could probably also causes issues like this. It should probably not get logged at WARN then.

If really traefik is at fault I wonder why/how because it simply uses the consul SDK :D

@shoenig
Copy link
Member Author

shoenig commented Jul 27, 2021

I'm fairly sure this is a result of re-using some connection handling bits intended for long-lived gRPC connections between the nomad agent and consul agent, and experiencing disconnects when being used with the Consul HTTP listener instead.

In a world where everything is easy we'd just proxy the HTTP requests, but that doesn't work when either Consul or the app is expecting to use mTLS. We could just get rid of the log statement since the TCP proxy gets re-created on the next HTTP request anyway, but I want to poke around a bit more and see if I can't manage the proxy lifecycle per request.

shoenig added a commit that referenced this issue Jul 27, 2021
When creating a TCP proxy bridge for Connect tasks, we are at the
mercy of either end for managing the connection state. For long
lived gRPC connections the proxy could reasonably expect to stay
open until the context was cancelled. For the HTTP connections used
by connect native tasks, we experience connection disconnects.
The proxy gets recreated as needed on follow up requests, however
we also emit a WARN log when the connection is broken. This PR
lowers the WARN to a TRACE, because these disconnects are to be
expected.

Ideally we would be able to proxy at the HTTP layer, however Consul
or the connect native task could be configured to expect mTLS, preventing
Nomad from MiTM the requests.

We also can't mange the proxy lifecycle more intelligently, because
we have no control over the HTTP client or server and how they wish
to manage connection state.

What we have now works, it's just noisy.

Fixes #10933
shoenig added a commit that referenced this issue Jul 28, 2021
When creating a TCP proxy bridge for Connect tasks, we are at the
mercy of either end for managing the connection state. For long
lived gRPC connections the proxy could reasonably expect to stay
open until the context was cancelled. For the HTTP connections used
by connect native tasks, we experience connection disconnects.
The proxy gets recreated as needed on follow up requests, however
we also emit a WARN log when the connection is broken. This PR
lowers the WARN to a TRACE, because these disconnects are to be
expected.

Ideally we would be able to proxy at the HTTP layer, however Consul
or the connect native task could be configured to expect mTLS, preventing
Nomad from MiTM the requests.

We also can't mange the proxy lifecycle more intelligently, because
we have no control over the HTTP client or server and how they wish
to manage connection state.

What we have now works, it's just noisy.

Fixes #10933
jrasell added a commit that referenced this issue Aug 5, 2021
When creating a TCP proxy bridge for Connect tasks, we are at the
mercy of either end for managing the connection state. For long
lived gRPC connections the proxy could reasonably expect to stay
open until the context was cancelled. For the HTTP connections used
by connect native tasks, we experience connection disconnects.
The proxy gets recreated as needed on follow up requests, however
we also emit a WARN log when the connection is broken. This PR
lowers the WARN to a TRACE, because these disconnects are to be
expected.

Ideally we would be able to proxy at the HTTP layer, however Consul
or the connect native task could be configured to expect mTLS, preventing
Nomad from MiTM the requests.

We also can't mange the proxy lifecycle more intelligently, because
we have no control over the HTTP client or server and how they wish
to manage connection state.

What we have now works, it's just noisy.

Fixes #10933
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants