io: keepalive error detection improvement #7788

leonardo-albertovich · 2023-08-03T14:59:39Z

This PR adds error code propagation for some non-recoverable errors which previously caused broken keepalive connections to be returned to the idle connection queue.

Signed-off-by: Leonardo Alminana <[email protected]>

leonardo-albertovich · 2023-08-03T17:04:11Z

I need to conduct some tests in Windows so I'll leave it as a draft until those are performed.

leonardo-albertovich · 2023-08-04T13:56:33Z

I have already verified that this behaves are expected in Windows.

matthewfala · 2023-08-08T18:00:31Z

src/flb_io.c

+    case ENOTCONN:
+    case EPIPE:
+    case EACCES:
+    case EIO:


Looks good, but wondering how you got a list of critical/unrecoverable errors?

EIO seems pretty broad and may not be unrecoverable, though I don't know anything about these errors beside what is stated in the manual - https://man7.org/linux/man-pages/man3/errno.3.html:

Input/output error

I got those from the man page entries for send and recv but you are probably right about EIO.
Just keep in mind that the only places where this mechanism is used are the socket io functions in flb_io.c.

As suggested I removed EIO and added ENOTTY.

matthewfala · 2023-08-08T19:06:18Z

src/tls/openssl.c

+             * to the net_error field.
+             */
+
+            session->connection->net_error = errno;


Just wondering, should the async paths be covered as well?

fluent-bit/src/flb_io.c

Line 656 in 072439b

ret = flb_tls_net_write_async(coro, connection->tls_session, data, len, out_len);

None of the flb_tls_ functions handle it in flb_tls.c because tls_net_read and tls_net_write in openssl.c handle it through the SSL_ERROR_SYSCALL code paths.

matthewfala · 2023-08-08T19:08:38Z

src/tls/openssl.c

@@ -434,6 +434,13 @@ static int tls_net_read(struct flb_tls_session *session,
            ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1);
            flb_error("[tls] syscall error: %s", err_buf);

+            /* According to the documentation these are non-recoverable


Any unrecoverable tls connect errors in the handshake SSL code section? - I suppose these failed connections will never make it to the keepalive available queue, so the current code is sufficient.

fluent-bit/src/tls/openssl.c

Line 508 in ebb6ddf

static int tls_net_handshake(struct flb_tls *tls,

You're right, errors there are caught but even then those never make it to the keepalive idle list.

matthewfala · 2023-08-08T19:28:52Z

src/flb_io.c

+                struct flb_connection *connection)
+{
+    switch (errno) {
+    case EBADF:


What about ENOTTY

#define ENOTTY 25 /* Not a typewriter */

I believe this appears as:

[error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device

which is a common network error in Fluent Bit
https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors

Uhm, I can remove EIO and add ENOTTY but I couldn't find any references to that code being a valid option for send or sendto.

Could it be by chance that this is an artifact of printing the value of errno when flb_io_net_write fails regardless of the reason?

I can't really think of a scenario where that would happen but if it happens we need to find out why so if you have any insight in terms of reproduction it'd be great if you shared it with us.

I can corroborate that I also see Inappropriate ioctl for device errors constantly in customers' production environments.

That doesn't sound good at all, do you have any leads? It'd be good if we could try to get to the bottom of it so if any of you have any information about the settings being used in those cases (I suppose it's mostly about the IO plugins) that'd be great.

True! That would be good to figure out. Not clear if this error is coming up during the sent/sendto or during something like setting a socket to non-blocking... So it's also not clear if the error's impact would be lessened by the addition of this code. Hopefully it is though.

I see IOCTL here:

fluent-bit/src/flb_network.c

Lines 225 to 238 in 450477c

int flb_net_socket_blocking(flb_sockfd_t fd)

{

#ifdef _WIN32

unsigned long off = 0;

if (ioctlsocket(fd, FIONBIO, &off) != 0) {

#else

if (fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK) == -1) {

#endif

flb_errno();

return -1;

}

return 0;

}

I was curious about that so I verified the code both in this branch and 1.9 but couldn't find any clues.

That doesn't sound good at all, do you have any leads?

Unfortunately not offhand, we've found that the log is usually benign and haven't really kept any records of it happening or tried to understand why. I've instructed the rest of the team to keep an eye out for the log in the future so we can try to provide some more info.

EIO was removed because it was too broad and ENOTTY was added because it seems to happen in some deployments even though it doesn't make a lot of sense. Signed-off-by: Leonardo Alminana <[email protected]>

leonardo-albertovich · 2023-08-09T16:52:20Z

@matthewfala I think I addressed all of your concerns but I didn't mark the conversations as resolved to make it easier for you to verify it.

matthewfala

Looks good! Thank you.

edsiper · 2023-08-29T10:45:16Z

thanks everyone for collaborating on this. may I assume this is good to go ?

leonardo-albertovich · 2023-08-29T12:59:13Z

I think so.

leonardo-albertovich added 4 commits August 3, 2023 16:44

tls: openssl: added code to propagate syscall errors

7d3394d

Signed-off-by: Leonardo Alminana <[email protected]>

upstream: added code to discard failed connections

db95149

Signed-off-by: Leonardo Alminana <[email protected]>

io: added code to detect non-recoverable connection errors

575d926

Signed-off-by: Leonardo Alminana <[email protected]>

io: added code to detect non-recoverable connection errors

4787b1c

Signed-off-by: Leonardo Alminana <[email protected]>

leonardo-albertovich requested review from edsiper, fujimotos and koleini as code owners August 3, 2023 14:59

github-actions bot added the docs-required label Aug 3, 2023

leonardo-albertovich temporarily deployed to pr August 3, 2023 15:00 — with GitHub Actions Inactive

leonardo-albertovich temporarily deployed to pr August 3, 2023 15:26 — with GitHub Actions Inactive

leonardo-albertovich marked this pull request as draft August 3, 2023 17:03

leonardo-albertovich marked this pull request as ready for review August 4, 2023 13:55

matthewfala reviewed Aug 8, 2023

View reviewed changes

io: changed some codes in net_io_propagate_critical_error

efedac3

EIO was removed because it was too broad and ENOTTY was added because it seems to happen in some deployments even though it doesn't make a lot of sense. Signed-off-by: Leonardo Alminana <[email protected]>

leonardo-albertovich temporarily deployed to pr August 9, 2023 16:51 — with GitHub Actions Inactive

leonardo-albertovich temporarily deployed to pr August 9, 2023 17:26 — with GitHub Actions Inactive

matthewfala approved these changes Aug 16, 2023

View reviewed changes

edsiper added this to the Fluent Bit v2.1.9 milestone Aug 29, 2023

edsiper merged commit 8ba3321 into master Aug 31, 2023

edsiper deleted the leonardo-master-keepalive-error-detection branch August 31, 2023 19:08

ZhongRuoyu mentioned this pull request Sep 5, 2023

fluent-bit 2.1.9 Homebrew/homebrew-core#141456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io: keepalive error detection improvement #7788

io: keepalive error detection improvement #7788

leonardo-albertovich commented Aug 3, 2023

leonardo-albertovich commented Aug 3, 2023

leonardo-albertovich commented Aug 4, 2023

matthewfala Aug 8, 2023

leonardo-albertovich Aug 9, 2023

leonardo-albertovich Aug 9, 2023

matthewfala Aug 8, 2023

leonardo-albertovich Aug 9, 2023

matthewfala Aug 8, 2023

leonardo-albertovich Aug 9, 2023

matthewfala Aug 8, 2023

leonardo-albertovich Aug 9, 2023

braydonk Aug 15, 2023

leonardo-albertovich Aug 15, 2023

matthewfala Aug 16, 2023

leonardo-albertovich Aug 16, 2023

braydonk Aug 17, 2023

leonardo-albertovich commented Aug 9, 2023

matthewfala left a comment

edsiper commented Aug 29, 2023

leonardo-albertovich commented Aug 29, 2023

	int flb_net_socket_blocking(flb_sockfd_t fd)
	{
	#ifdef _WIN32
	unsigned long off = 0;
	if (ioctlsocket(fd, FIONBIO, &off) != 0) {
	#else
	if (fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) & ~O_NONBLOCK) == -1) {
	#endif
	flb_errno();
	return -1;
	}

	return 0;
	}

io: keepalive error detection improvement #7788

io: keepalive error detection improvement #7788

Conversation

leonardo-albertovich commented Aug 3, 2023

leonardo-albertovich commented Aug 3, 2023

leonardo-albertovich commented Aug 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardo-albertovich commented Aug 9, 2023

matthewfala left a comment

Choose a reason for hiding this comment

edsiper commented Aug 29, 2023

leonardo-albertovich commented Aug 29, 2023