Handle HTTP/2 GOAWAY messages when writing to outputs #11901

btasker · 2022-09-28T15:00:04Z

Use Case

HTTP/2 introduced a means for servers to tell clients to stop sending requests over a specific connection - the GOAWAY signal - (HTTP/1.1 needed to wait until it finished processing a request and then send Connection: Close, HTTP/2 does not).

AWS's Application Load Balancers (ALBs) sometimes send GOAWAY messages, for various reasons, the two main documented ones being

if the compressed length of any of the headers exceeds 8 K bytes
if the number of requests served through one connection exceeds 10,000

So, if an output connection is sufficiently long-lived that it carries 10,000 writes (which may not be that long a time for a busy instance), the ALB will eventually send a GOAWAY

Expected behavior

Telegraf should receive the GOAWAY, close the connection, and resubmit over a new one.

Actual behavior

Telegraf logs an error

2022-09-12T10:37:00Z E! [outputs.influxdb] When writing to [https://[ALB address]:8086/]: failed doing req: Post "https://[ALB address]:8086/write?db=messaging": http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
2022-09-12T10:37:00Z E! [agent] Error writing to outputs.influxdb: could not write any address

Additional info

The data isn't lost - because the write didn't complete successfully, it remains in the buffer and will be written out at the next flush interval.

But, it does lead to a level of log noise where telegraf's being used as an aggregator and performing a lot of writes.

The text was updated successfully, but these errors were encountered:

powersj · 2022-09-28T15:43:31Z

For v1, it looks like when an error happens it is immediately logged after the write attempt. We could check for the error and then either close the connection to have a fresh one on retry or reconnect and try again. I do think if we do not immediately retry, we should still emit some sort of log message to explain why a write did not complete.

    if e, ok := err.(http2.GoAwayError); ok {
        // close the connection and/or reconnect and try again?
    }

It seems something similar occurred in #7517 which points to golang/go#36026

For v2, we could do the same.

mvahani · 2023-02-21T06:33:25Z

The big problem we see, with this one, is that the telegraf agent never recovers even if the connection if things are working again. Nothing happens until Telegraf is restarted.

btasker added the feature request Requests for new plugin and for new features to existing plugins label Sep 28, 2022

powersj added help wanted Request for community participation, code, contribution size/m 2-4 day effort labels Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle HTTP/2 GOAWAY messages when writing to outputs #11901

Handle HTTP/2 GOAWAY messages when writing to outputs #11901

btasker commented Sep 28, 2022

powersj commented Sep 28, 2022

mvahani commented Feb 21, 2023

Handle HTTP/2 GOAWAY messages when writing to outputs #11901

Handle HTTP/2 GOAWAY messages when writing to outputs #11901

Comments

btasker commented Sep 28, 2022

Use Case

Expected behavior

Actual behavior

Additional info

powersj commented Sep 28, 2022

mvahani commented Feb 21, 2023