-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io: keepalive error detection improvement #7788
Conversation
Signed-off-by: Leonardo Alminana <[email protected]>
Signed-off-by: Leonardo Alminana <[email protected]>
Signed-off-by: Leonardo Alminana <[email protected]>
Signed-off-by: Leonardo Alminana <[email protected]>
I need to conduct some tests in Windows so I'll leave it as a draft until those are performed. |
I have already verified that this behaves are expected in Windows. |
src/flb_io.c
Outdated
case ENOTCONN: | ||
case EPIPE: | ||
case EACCES: | ||
case EIO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but wondering how you got a list of critical/unrecoverable errors?
EIO seems pretty broad and may not be unrecoverable, though I don't know anything about these errors beside what is stated in the manual - https://man7.org/linux/man-pages/man3/errno.3.html:
Input/output error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got those from the man page entries for send and recv but you are probably right about EIO.
Just keep in mind that the only places where this mechanism is used are the socket io functions in flb_io.c
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested I removed EIO
and added ENOTTY
.
* to the net_error field. | ||
*/ | ||
|
||
session->connection->net_error = errno; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, should the async paths be covered as well?
Line 656 in 072439b
ret = flb_tls_net_write_async(coro, connection->tls_session, data, len, out_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the flb_tls_
functions handle it in flb_tls.c
because tls_net_read
and tls_net_write
in openssl.c
handle it through the SSL_ERROR_SYSCALL
code paths.
@@ -434,6 +434,13 @@ static int tls_net_read(struct flb_tls_session *session, | |||
ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1); | |||
flb_error("[tls] syscall error: %s", err_buf); | |||
|
|||
/* According to the documentation these are non-recoverable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any unrecoverable tls connect errors in the handshake SSL code section? - I suppose these failed connections will never make it to the keepalive available queue, so the current code is sufficient.
Line 508 in ebb6ddf
static int tls_net_handshake(struct flb_tls *tls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, errors there are caught but even then those never make it to the keepalive idle list.
struct flb_connection *connection) | ||
{ | ||
switch (errno) { | ||
case EBADF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ENOTTY
#define ENOTTY 25 /* Not a typewriter */
I believe this appears as:
[error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
which is a common network error in Fluent Bit
https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I can remove EIO
and add ENOTTY
but I couldn't find any references to that code being a valid option for send
or sendto
.
Could it be by chance that this is an artifact of printing the value of errno when flb_io_net_write
fails regardless of the reason?
I can't really think of a scenario where that would happen but if it happens we need to find out why so if you have any insight in terms of reproduction it'd be great if you shared it with us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can corroborate that I also see Inappropriate ioctl for device
errors constantly in customers' production environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound good at all, do you have any leads? It'd be good if we could try to get to the bottom of it so if any of you have any information about the settings being used in those cases (I suppose it's mostly about the IO plugins) that'd be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True! That would be good to figure out. Not clear if this error is coming up during the sent/sendto or during something like setting a socket to non-blocking... So it's also not clear if the error's impact would be lessened by the addition of this code. Hopefully it is though.
I see IOCTL here:
Lines 225 to 238 in 450477c
int flb_net_socket_blocking(flb_sockfd_t fd) | |
{ | |
#ifdef _WIN32 | |
unsigned long off = 0; | |
if (ioctlsocket(fd, FIONBIO, &off) != 0) { | |
#else | |
if (fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK) == -1) { | |
#endif | |
flb_errno(); | |
return -1; | |
} | |
return 0; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious about that so I verified the code both in this branch and 1.9 but couldn't find any clues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound good at all, do you have any leads?
Unfortunately not offhand, we've found that the log is usually benign and haven't really kept any records of it happening or tried to understand why. I've instructed the rest of the team to keep an eye out for the log in the future so we can try to provide some more info.
EIO was removed because it was too broad and ENOTTY was added because it seems to happen in some deployments even though it doesn't make a lot of sense. Signed-off-by: Leonardo Alminana <[email protected]>
@matthewfala I think I addressed all of your concerns but I didn't mark the conversations as resolved to make it easier for you to verify it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you.
thanks everyone for collaborating on this. may I assume this is good to go ? |
I think so. |
This PR adds error code propagation for some non-recoverable errors which previously caused broken keepalive connections to be returned to the idle connection queue.