Skip to content
This repository has been archived by the owner on Jun 29, 2023. It is now read-only.

Detect disconnected sockets #88

Closed
mp911de opened this issue Jun 24, 2016 · 1 comment
Closed

Detect disconnected sockets #88

mp911de opened this issue Jun 24, 2016 · 1 comment
Labels
type: bug A general bug

Comments

@mp911de
Copy link
Owner

mp911de commented Jun 24, 2016

Hello Mark,

Hope you can help us with the following issue with logstash-gelf. We use
it in AWS environment to collect logs from a couple dozen instances; our
Graylog2 farm is behind an AWS Elastic LoadBalancer w/ TCP balancing.
For apps that log less often than the timeout specified on the ELB (60 s
by default) we experience total log event loss after the initial event
batch on application startup is sent successfully; sometimes we do see
the next batch of events make it through, and sometimes we don't.
Playing with keepAlive and deliveryAttempts had no effect.

I can't claim to completely understand what's going on, but my current
hypothesis, supported by observing network-level traffic with Wireshark,
is as follows:

  1. Appender establishes TCP connection to the ELB and starts sending
    messages.
  2. After 60 s of inactivity, ELB sends us the FIN/ACK, and connection is
    dropped (as evidenced by ACK from our side). For some reason, this fact
    doesn't get propagated to the SocketChannel used by the appender.
  3. If application logs an event after that, appender tries to reuse the
    already dropped connection (as evidenced by a number of PSH/TCP
    Retransmit messages). ELB sends RST in response, finally killing the
    connection.
  4. However, very often appender doesn't learn about that fact, as - I
    guess due to NIO - all the bytes it wanted to send are already handed
    off to the OS buffer and the call to socketChannel.write() had returned
    by the time that RST arrives.
  5. When next event arrives, connection failure is finally detected, a
    new connection is established, the event is logged, and the cycle
    repeats itself.

Reported by @vdenisov

mp911de added a commit that referenced this issue Jun 24, 2016
NIO channels don't discover a disconnect without activity. logstash-gelf now performs a read operation before writing data. This way the socket can discover the connection state. Reading is non-blocking so the performance impact is minor.
@mp911de mp911de added the type: bug A general bug label Jun 24, 2016
mp911de added a commit that referenced this issue Jul 16, 2016
NIO channels don't discover a disconnect without activity. logstash-gelf now performs a read operation before writing data. This way the socket can discover the connection state. Reading is non-blocking so the performance impact is minor.
mp911de added a commit that referenced this issue Jul 16, 2016
NIO channels don't discover a disconnect without activity. logstash-gelf now performs a read operation before writing data. This way the socket can discover the connection state. Reading is non-blocking so the performance impact is minor.
@mp911de
Copy link
Owner Author

mp911de commented Jul 18, 2016

Done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

1 participant