Detect disconnected sockets #88

mp911de · 2016-06-24T19:04:53Z

Hello Mark,

Hope you can help us with the following issue with logstash-gelf. We use
it in AWS environment to collect logs from a couple dozen instances; our
Graylog2 farm is behind an AWS Elastic LoadBalancer w/ TCP balancing.
For apps that log less often than the timeout specified on the ELB (60 s
by default) we experience total log event loss after the initial event
batch on application startup is sent successfully; sometimes we do see
the next batch of events make it through, and sometimes we don't.
Playing with keepAlive and deliveryAttempts had no effect.

I can't claim to completely understand what's going on, but my current
hypothesis, supported by observing network-level traffic with Wireshark,
is as follows:

Appender establishes TCP connection to the ELB and starts sending
messages.
After 60 s of inactivity, ELB sends us the FIN/ACK, and connection is
dropped (as evidenced by ACK from our side). For some reason, this fact
doesn't get propagated to the SocketChannel used by the appender.
If application logs an event after that, appender tries to reuse the
already dropped connection (as evidenced by a number of PSH/TCP
Retransmit messages). ELB sends RST in response, finally killing the
connection.
However, very often appender doesn't learn about that fact, as - I
guess due to NIO - all the bytes it wanted to send are already handed
off to the OS buffer and the call to socketChannel.write() had returned
by the time that RST arrives.
When next event arrives, connection failure is finally detected, a
new connection is established, the event is logged, and the cycle
repeats itself.

Reported by @vdenisov

NIO channels don't discover a disconnect without activity. logstash-gelf now performs a read operation before writing data. This way the socket can discover the connection state. Reading is non-blocking so the performance impact is minor.

mp911de · 2016-07-18T18:56:37Z

Done.

mp911de added the type: bug A general bug label Jun 24, 2016

mp911de closed this as completed Jul 18, 2016

mp911de added this to the logstash-gelf 1.11.0 milestone Jul 18, 2016

mp911de mentioned this issue Oct 6, 2016

Corrupted gelf messages with TCP sender #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect disconnected sockets #88

Detect disconnected sockets #88

mp911de commented Jun 24, 2016 •

edited

Loading

mp911de commented Jul 18, 2016

Detect disconnected sockets #88

Detect disconnected sockets #88

Comments

mp911de commented Jun 24, 2016 • edited Loading

mp911de commented Jul 18, 2016

mp911de commented Jun 24, 2016 •

edited

Loading