Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection keeps breaking with gun_down #158

Closed
ospaarmann opened this issue Jun 15, 2018 · 17 comments
Closed

Connection keeps breaking with gun_down #158

ospaarmann opened this issue Jun 15, 2018 · 17 comments

Comments

@ospaarmann
Copy link

Hey,

I use the Elixir gRPC implementation https://github.com/tony612/grpc-elixir to build a Dgraph client written in Elixir (https://github.com/ospaarmann/exdgraph). grpc-elixir in turn uses gun (1.0.0-pre.5).

My connection keeps breaking every X seconds with gun_down and reason normal.

[error] ExDgraph.Protocol #PID<0.280.0> received unexpected message: {:gun_down, #PID<0.288.0>, :http2, :normal, [], []}

I don't see anything in the server Dgraph server logs. The connection isn't using ssl. Interestingly it happens very regularly every 5 or so seconds so.

I am a bit lost on how to debug this and I have little experience with Erlang, only Elixir so far. My apologies that I cannot provide more information right now but I can deliver whatever helps.

Any hint or help is greatly appreciated!
❤️

@essen
Copy link
Member

essen commented Jun 15, 2018

Try setting a lower keepalive value.

@ospaarmann
Copy link
Author

Hey @essen,

thank you for the suggestion. I tried lowering the keepalive value in the http2 options. This caused the error to appear more often. With the default value of keepalive (5000) it appeared roughly every 20 seconds. Increasing it to keepalive (10000) let to the error appearing every 40 seconds.

Now I set keepalive to :infinity. This seems to fix the issue. Are there any downsides to this that you are aware of?

Thanks

@essen
Copy link
Member

essen commented Jun 15, 2018

Sounds like there's a bug either in Gun's ping message or in the server.

The downside is that this disables pings entirely and so you might not know when a server is gone until the write buffers are full.

@essen
Copy link
Member

essen commented Jun 15, 2018

What's the server?

@ospaarmann
Copy link
Author

True that @essen. I will contact the devs from Dgraph and report back if they can identify a bug so that we can narrow it down.

@ospaarmann
Copy link
Author

@ospaarmann
Copy link
Author

My guess is that this is rather a bug in gun then. The reason for my suspicion is that Dgraph has clients in Go, Java, Javascript, Python, etc. and they don't seem to have this issue.

@essen
Copy link
Member

essen commented Jun 15, 2018

Feel free to double check the code:

Would be good to have server logs for why the connection was dropped, or to trace the modules gun and gun_http2 to get that same reason.

@ospaarmann
Copy link
Author

ospaarmann commented Jun 15, 2018

The server doesn't write anything to the logs when the connection ends.

How would I go about tracing gun and gun_http2 in an Elixir application?

@essen
Copy link
Member

essen commented Jun 15, 2018

In Erlang:

dbg:start().
dbg:tracer().
dbg:tpl(gun, []).
dbg:tpl(gun_http2, []).
dbg:p(all ,c).

@ospaarmann
Copy link
Author

ospaarmann commented Jun 15, 2018

Thanks. Elixir was very similar since you can call Erlang modules with :module.fun(). So

:dbg.start()
:dbg.tracer()
:dbg.tpl(:gun, [])
:dbg.tpl(:gun_http2, [])
:dbg.p(:all ,:c)

The raw output of the trace. I in between are the gun_down messages from my client: https://gist.github.com/ospaarmann/c592fa984f775810b2698bf0b4d82228

@essen
Copy link
Member

essen commented Jun 15, 2018

Server is dropping the connection with {goaway,0,enhance_your_calm,<<"too_many_pings">>}.

I suppose 5 seconds is too much? Not sure why, but I guess setting a larger value would fix your issue.

Cheers.

@essen essen closed this as completed Jun 15, 2018
@ospaarmann
Copy link
Author

Thank you for the fast support! I will try that.

@ryanwinchester
Copy link

@ospaarmann did you ever fix this without setting keepalive to infinity?

@essen
Copy link
Member

essen commented Oct 7, 2020

For what it's worth the default in 2.0 will be infinity. See 556645d

@ryanwinchester
Copy link

ryanwinchester commented Oct 7, 2020

@essen Yeah, I've been trying to find out more info all day and finally getting into some stuff that makes sense. As you probably expected, it seems it's likely just a mismatch in the keepalive settings between gun's default and grpc server defaults.

Example from grpc core there is lots of good info: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html

I think it probably has to do with:

GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS
  • If there are no data/header frames being sent on the transport, this channel argument on the server side controls the minimum time (in milliseconds) that gRPC Core would expect between receiving successive pings. If the time between successive pings is less that than this time, then the ping will be considered a bad ping from the peer. Such a ping counts as a ‘ping strike’. On the client side, this does not have any effect.

And from the FAQ, the last bullet point in the last question:

FAQ

  • Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
    • A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings. For example:
      • if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false and the client sends pings without there being any call in flight.
      • if the client's GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.

Defaults

Channel Argument Client Server
GRPC_ARG_KEEPALIVE_TIME_MS INT_MAX (disabled) 7200000 (2 hours)
GRPC_ARG_KEEPALIVE_TIMEOUT_MS 20000 (20 seconds) 20000 (20 seconds)
GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS 0 (false) 0 (false)
GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA 2 2
GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS 300000 (5 minutes) 300000 (5 minutes)
GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS N/A 300000 (5 minutes)
GRPC_ARG_HTTP2_MAX_PING_STRIKES N/A 2

@essen
Copy link
Member

essen commented Oct 7, 2020

And if servers are going to have these kinds of defaults it's not worth doing pings without prior knowledge that they're going to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants