Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: keepalive error detection improvement #7788

Merged
merged 5 commits into from
Aug 31, 2023

Conversation

leonardo-albertovich
Copy link
Collaborator

This PR adds error code propagation for some non-recoverable errors which previously caused broken keepalive connections to be returned to the idle connection queue.

@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 3, 2023 15:00 — with GitHub Actions Inactive
@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 3, 2023 15:00 — with GitHub Actions Inactive
@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 3, 2023 15:00 — with GitHub Actions Inactive
@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 3, 2023 15:26 — with GitHub Actions Inactive
@leonardo-albertovich leonardo-albertovich marked this pull request as draft August 3, 2023 17:03
@leonardo-albertovich
Copy link
Collaborator Author

I need to conduct some tests in Windows so I'll leave it as a draft until those are performed.

@leonardo-albertovich leonardo-albertovich marked this pull request as ready for review August 4, 2023 13:55
@leonardo-albertovich
Copy link
Collaborator Author

I have already verified that this behaves are expected in Windows.

src/flb_io.c Outdated
case ENOTCONN:
case EPIPE:
case EACCES:
case EIO:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but wondering how you got a list of critical/unrecoverable errors?

EIO seems pretty broad and may not be unrecoverable, though I don't know anything about these errors beside what is stated in the manual - https://man7.org/linux/man-pages/man3/errno.3.html:

Input/output error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got those from the man page entries for send and recv but you are probably right about EIO.
Just keep in mind that the only places where this mechanism is used are the socket io functions in flb_io.c.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested I removed EIO and added ENOTTY.

* to the net_error field.
*/

session->connection->net_error = errno;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, should the async paths be covered as well?

ret = flb_tls_net_write_async(coro, connection->tls_session, data, len, out_len);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the flb_tls_ functions handle it in flb_tls.c because tls_net_read and tls_net_write in openssl.c handle it through the SSL_ERROR_SYSCALL code paths.

@@ -434,6 +434,13 @@ static int tls_net_read(struct flb_tls_session *session,
ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1);
flb_error("[tls] syscall error: %s", err_buf);

/* According to the documentation these are non-recoverable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any unrecoverable tls connect errors in the handshake SSL code section? - I suppose these failed connections will never make it to the keepalive available queue, so the current code is sufficient.

static int tls_net_handshake(struct flb_tls *tls,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, errors there are caught but even then those never make it to the keepalive idle list.

struct flb_connection *connection)
{
switch (errno) {
case EBADF:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ENOTTY

#define	ENOTTY		25	/* Not a typewriter */

I believe this appears as:

[error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device

which is a common network error in Fluent Bit
https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I can remove EIO and add ENOTTY but I couldn't find any references to that code being a valid option for send or sendto.

Could it be by chance that this is an artifact of printing the value of errno when flb_io_net_write fails regardless of the reason?

I can't really think of a scenario where that would happen but if it happens we need to find out why so if you have any insight in terms of reproduction it'd be great if you shared it with us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can corroborate that I also see Inappropriate ioctl for device errors constantly in customers' production environments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound good at all, do you have any leads? It'd be good if we could try to get to the bottom of it so if any of you have any information about the settings being used in those cases (I suppose it's mostly about the IO plugins) that'd be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! That would be good to figure out. Not clear if this error is coming up during the sent/sendto or during something like setting a socket to non-blocking... So it's also not clear if the error's impact would be lessened by the addition of this code. Hopefully it is though.

I see IOCTL here:

int flb_net_socket_blocking(flb_sockfd_t fd)
{
#ifdef _WIN32
unsigned long off = 0;
if (ioctlsocket(fd, FIONBIO, &off) != 0) {
#else
if (fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK) == -1) {
#endif
flb_errno();
return -1;
}
return 0;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about that so I verified the code both in this branch and 1.9 but couldn't find any clues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound good at all, do you have any leads?

Unfortunately not offhand, we've found that the log is usually benign and haven't really kept any records of it happening or tried to understand why. I've instructed the rest of the team to keep an eye out for the log in the future so we can try to provide some more info.

EIO was removed because it was too broad and ENOTTY was added because it
seems to happen in some deployments even though it doesn't make a lot
of sense.

Signed-off-by: Leonardo Alminana <[email protected]>
@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 9, 2023 16:51 — with GitHub Actions Inactive
@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 9, 2023 16:51 — with GitHub Actions Inactive
@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 9, 2023 16:51 — with GitHub Actions Inactive
@leonardo-albertovich
Copy link
Collaborator Author

@matthewfala I think I addressed all of your concerns but I didn't mark the conversations as resolved to make it easier for you to verify it.

@leonardo-albertovich leonardo-albertovich temporarily deployed to pr August 9, 2023 17:26 — with GitHub Actions Inactive
Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you.

@edsiper
Copy link
Member

edsiper commented Aug 29, 2023

thanks everyone for collaborating on this. may I assume this is good to go ?

@edsiper edsiper added this to the Fluent Bit v2.1.9 milestone Aug 29, 2023
@leonardo-albertovich
Copy link
Collaborator Author

I think so.

@edsiper edsiper merged commit 8ba3321 into master Aug 31, 2023
@edsiper edsiper deleted the leonardo-master-keepalive-error-detection branch August 31, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants