Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: keepalive error detection improvement #7788

Merged
merged 5 commits into from
Aug 31, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion src/flb_io.c
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,23 @@ int flb_io_net_connect(struct flb_connection *connection,
return 0;
}

static void net_io_propagate_critical_error(
struct flb_connection *connection)
{
switch (errno) {
case EBADF:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ENOTTY

#define	ENOTTY		25	/* Not a typewriter */

I believe this appears as:

[error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device

which is a common network error in Fluent Bit
https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I can remove EIO and add ENOTTY but I couldn't find any references to that code being a valid option for send or sendto.

Could it be by chance that this is an artifact of printing the value of errno when flb_io_net_write fails regardless of the reason?

I can't really think of a scenario where that would happen but if it happens we need to find out why so if you have any insight in terms of reproduction it'd be great if you shared it with us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can corroborate that I also see Inappropriate ioctl for device errors constantly in customers' production environments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound good at all, do you have any leads? It'd be good if we could try to get to the bottom of it so if any of you have any information about the settings being used in those cases (I suppose it's mostly about the IO plugins) that'd be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! That would be good to figure out. Not clear if this error is coming up during the sent/sendto or during something like setting a socket to non-blocking... So it's also not clear if the error's impact would be lessened by the addition of this code. Hopefully it is though.

I see IOCTL here:

int flb_net_socket_blocking(flb_sockfd_t fd)
{
#ifdef _WIN32
unsigned long off = 0;
if (ioctlsocket(fd, FIONBIO, &off) != 0) {
#else
if (fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK) == -1) {
#endif
flb_errno();
return -1;
}
return 0;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about that so I verified the code both in this branch and 1.9 but couldn't find any clues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound good at all, do you have any leads?

Unfortunately not offhand, we've found that the log is usually benign and haven't really kept any records of it happening or tried to understand why. I've instructed the rest of the team to keep an eye out for the log in the future so we can try to provide some more info.

case ECONNRESET:
case EDESTADDRREQ:
case ENOTCONN:
case EPIPE:
case EACCES:
case EIO:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but wondering how you got a list of critical/unrecoverable errors?

EIO seems pretty broad and may not be unrecoverable, though I don't know anything about these errors beside what is stated in the manual - https://man7.org/linux/man-pages/man3/errno.3.html:

Input/output error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got those from the man page entries for send and recv but you are probably right about EIO.
Just keep in mind that the only places where this mechanism is used are the socket io functions in flb_io.c.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested I removed EIO and added ENOTTY.

case ENETDOWN:
case ENETUNREACH:
connection->net_error = errno;
}
}

static int fd_io_write(int fd, struct sockaddr_storage *address,
const void *data, size_t len, size_t *out_len);
static int net_io_write(struct flb_connection *connection,
Expand Down Expand Up @@ -204,7 +221,13 @@ static int net_io_write(struct flb_connection *connection,
}
}

return fd_io_write(connection->fd, address, data, len, out_len);
ret = fd_io_write(connection->fd, address, data, len, out_len);

if (ret == -1) {
net_io_propagate_critical_error(connection);
}

return ret;
}

static int fd_io_write(int fd, struct sockaddr_storage *address,
Expand Down Expand Up @@ -430,6 +453,7 @@ static FLB_INLINE int net_io_write_async(struct flb_coro *co,
*out_len = total;

net_io_restore_event(connection, &event_backup);
net_io_propagate_critical_error(connection);

return -1;
}
Expand Down Expand Up @@ -519,6 +543,9 @@ static ssize_t net_io_read(struct flb_connection *connection,
connection->net->io_timeout,
flb_connection_get_remote_address(connection));
}
else {
net_io_propagate_critical_error(connection);
}

return -1;
}
Expand Down Expand Up @@ -597,6 +624,9 @@ static FLB_INLINE ssize_t net_io_read_async(struct flb_coro *co,

goto retry_read;
}
else {
net_io_propagate_critical_error(connection);
}

ret = -1;
}
Expand Down
9 changes: 5 additions & 4 deletions src/flb_upstream.c
Original file line number Diff line number Diff line change
Expand Up @@ -726,9 +726,6 @@ struct flb_connection *flb_upstream_conn_get(struct flb_upstream *u)

flb_stream_release_lock(&u->base);

/* Reset errno */
conn->net_error = -1;

err = flb_socket_error(conn->fd);

if (!FLB_EINPROGRESS(err) && err != 0) {
Expand All @@ -740,6 +737,9 @@ struct flb_connection *flb_upstream_conn_get(struct flb_upstream *u)
continue;
}

/* Reset errno */
conn->net_error = -1;

/* Connect timeout */
conn->ts_assigned = time(NULL);
flb_debug("[upstream] KA connection #%i to %s:%i has been assigned (recycled)",
Expand Down Expand Up @@ -803,7 +803,8 @@ int flb_upstream_conn_release(struct flb_connection *conn)
/* If this is a valid KA connection just recycle */
if (u->base.net.keepalive == FLB_TRUE &&
conn->recycle == FLB_TRUE &&
conn->fd > -1) {
conn->fd > -1 &&
conn->net_error == -1) {
/*
* This connection is still useful, move it to the 'available' queue and
* initialize variables.
Expand Down
14 changes: 14 additions & 0 deletions src/tls/openssl.c
Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,13 @@ static int tls_net_read(struct flb_tls_session *session,
ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1);
flb_error("[tls] syscall error: %s", err_buf);

/* According to the documentation these are non-recoverable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any unrecoverable tls connect errors in the handshake SSL code section? - I suppose these failed connections will never make it to the keepalive available queue, so the current code is sufficient.

static int tls_net_handshake(struct flb_tls *tls,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, errors there are caught but even then those never make it to the keepalive idle list.

* errors so we don't need to screen them before saving them
* to the net_error field.
*/

session->connection->net_error = errno;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, should the async paths be covered as well?

ret = flb_tls_net_write_async(coro, connection->tls_session, data, len, out_len);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the flb_tls_ functions handle it in flb_tls.c because tls_net_read and tls_net_write in openssl.c handle it through the SSL_ERROR_SYSCALL code paths.


ret = -1;
}
else if (ret < 0) {
Expand Down Expand Up @@ -489,6 +496,13 @@ static int tls_net_write(struct flb_tls_session *session,
ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1);
flb_error("[tls] syscall error: %s", err_buf);

/* According to the documentation these are non-recoverable
* errors so we don't need to screen them before saving them
* to the net_error field.
*/

session->connection->net_error = errno;

ret = -1;
}
else {
Expand Down