-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io: keepalive error detection improvement #7788
Changes from 4 commits
7d3394d
db95149
575d926
4787b1c
efedac3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -174,6 +174,23 @@ int flb_io_net_connect(struct flb_connection *connection, | |
return 0; | ||
} | ||
|
||
static void net_io_propagate_critical_error( | ||
struct flb_connection *connection) | ||
{ | ||
switch (errno) { | ||
case EBADF: | ||
case ECONNRESET: | ||
case EDESTADDRREQ: | ||
case ENOTCONN: | ||
case EPIPE: | ||
case EACCES: | ||
case EIO: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good, but wondering how you got a list of critical/unrecoverable errors? EIO seems pretty broad and may not be unrecoverable, though I don't know anything about these errors beside what is stated in the manual - https://man7.org/linux/man-pages/man3/errno.3.html:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got those from the man page entries for send and recv but you are probably right about EIO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As suggested I removed |
||
case ENETDOWN: | ||
case ENETUNREACH: | ||
connection->net_error = errno; | ||
} | ||
} | ||
|
||
static int fd_io_write(int fd, struct sockaddr_storage *address, | ||
const void *data, size_t len, size_t *out_len); | ||
static int net_io_write(struct flb_connection *connection, | ||
|
@@ -204,7 +221,13 @@ static int net_io_write(struct flb_connection *connection, | |
} | ||
} | ||
|
||
return fd_io_write(connection->fd, address, data, len, out_len); | ||
ret = fd_io_write(connection->fd, address, data, len, out_len); | ||
|
||
if (ret == -1) { | ||
net_io_propagate_critical_error(connection); | ||
} | ||
|
||
return ret; | ||
} | ||
|
||
static int fd_io_write(int fd, struct sockaddr_storage *address, | ||
|
@@ -430,6 +453,7 @@ static FLB_INLINE int net_io_write_async(struct flb_coro *co, | |
*out_len = total; | ||
|
||
net_io_restore_event(connection, &event_backup); | ||
net_io_propagate_critical_error(connection); | ||
|
||
return -1; | ||
} | ||
|
@@ -519,6 +543,9 @@ static ssize_t net_io_read(struct flb_connection *connection, | |
connection->net->io_timeout, | ||
flb_connection_get_remote_address(connection)); | ||
} | ||
else { | ||
net_io_propagate_critical_error(connection); | ||
} | ||
|
||
return -1; | ||
} | ||
|
@@ -597,6 +624,9 @@ static FLB_INLINE ssize_t net_io_read_async(struct flb_coro *co, | |
|
||
goto retry_read; | ||
} | ||
else { | ||
net_io_propagate_critical_error(connection); | ||
} | ||
|
||
ret = -1; | ||
} | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -434,6 +434,13 @@ static int tls_net_read(struct flb_tls_session *session, | |||
ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1); | ||||
flb_error("[tls] syscall error: %s", err_buf); | ||||
|
||||
/* According to the documentation these are non-recoverable | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any unrecoverable tls connect errors in the handshake SSL code section? - I suppose these failed connections will never make it to the keepalive available queue, so the current code is sufficient. Line 508 in ebb6ddf
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right, errors there are caught but even then those never make it to the keepalive idle list. |
||||
* errors so we don't need to screen them before saving them | ||||
* to the net_error field. | ||||
*/ | ||||
|
||||
session->connection->net_error = errno; | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just wondering, should the async paths be covered as well? Line 656 in 072439b
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. None of the |
||||
|
||||
ret = -1; | ||||
} | ||||
else if (ret < 0) { | ||||
|
@@ -489,6 +496,13 @@ static int tls_net_write(struct flb_tls_session *session, | |||
ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1); | ||||
flb_error("[tls] syscall error: %s", err_buf); | ||||
|
||||
/* According to the documentation these are non-recoverable | ||||
* errors so we don't need to screen them before saving them | ||||
* to the net_error field. | ||||
*/ | ||||
|
||||
session->connection->net_error = errno; | ||||
|
||||
ret = -1; | ||||
} | ||||
else { | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ENOTTY
I believe this appears as:
which is a common network error in Fluent Bit
https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I can remove
EIO
and addENOTTY
but I couldn't find any references to that code being a valid option forsend
orsendto
.Could it be by chance that this is an artifact of printing the value of errno when
flb_io_net_write
fails regardless of the reason?I can't really think of a scenario where that would happen but if it happens we need to find out why so if you have any insight in terms of reproduction it'd be great if you shared it with us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can corroborate that I also see
Inappropriate ioctl for device
errors constantly in customers' production environments.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound good at all, do you have any leads? It'd be good if we could try to get to the bottom of it so if any of you have any information about the settings being used in those cases (I suppose it's mostly about the IO plugins) that'd be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True! That would be good to figure out. Not clear if this error is coming up during the sent/sendto or during something like setting a socket to non-blocking... So it's also not clear if the error's impact would be lessened by the addition of this code. Hopefully it is though.
I see IOCTL here:
fluent-bit/src/flb_network.c
Lines 225 to 238 in 450477c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious about that so I verified the code both in this branch and 1.9 but couldn't find any clues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not offhand, we've found that the log is usually benign and haven't really kept any records of it happening or tried to understand why. I've instructed the rest of the team to keep an eye out for the log in the future so we can try to provide some more info.