Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle HTTP/2 GOAWAY messages when writing to outputs #11901

Open
btasker opened this issue Sep 28, 2022 · 2 comments
Open

Handle HTTP/2 GOAWAY messages when writing to outputs #11901

btasker opened this issue Sep 28, 2022 · 2 comments
Labels
feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort

Comments

@btasker
Copy link
Contributor

btasker commented Sep 28, 2022

Use Case

HTTP/2 introduced a means for servers to tell clients to stop sending requests over a specific connection - the GOAWAY signal - (HTTP/1.1 needed to wait until it finished processing a request and then send Connection: Close, HTTP/2 does not).

AWS's Application Load Balancers (ALBs) sometimes send GOAWAY messages, for various reasons, the two main documented ones being

  • if the compressed length of any of the headers exceeds 8 K bytes
  • if the number of requests served through one connection exceeds 10,000

So, if an output connection is sufficiently long-lived that it carries 10,000 writes (which may not be that long a time for a busy instance), the ALB will eventually send a GOAWAY

Expected behavior

Telegraf should receive the GOAWAY, close the connection, and resubmit over a new one.

Actual behavior

Telegraf logs an error

2022-09-12T10:37:00Z E! [outputs.influxdb] When writing to [https://[ALB address]:8086/]: failed doing req: Post "https://[ALB address]:8086/write?db=messaging": http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
2022-09-12T10:37:00Z E! [agent] Error writing to outputs.influxdb: could not write any address

Additional info

The data isn't lost - because the write didn't complete successfully, it remains in the buffer and will be written out at the next flush interval.

But, it does lead to a level of log noise where telegraf's being used as an aggregator and performing a lot of writes.

@btasker btasker added the feature request Requests for new plugin and for new features to existing plugins label Sep 28, 2022
@powersj
Copy link
Contributor

powersj commented Sep 28, 2022

For v1, it looks like when an error happens it is immediately logged after the write attempt. We could check for the error and then either close the connection to have a fresh one on retry or reconnect and try again. I do think if we do not immediately retry, we should still emit some sort of log message to explain why a write did not complete.

    if e, ok := err.(http2.GoAwayError); ok {
        // close the connection and/or reconnect and try again?
    }

It seems something similar occurred in #7517 which points to golang/go#36026

For v2, we could do the same.

@powersj powersj added help wanted Request for community participation, code, contribution size/m 2-4 day effort labels Sep 28, 2022
@mvahani
Copy link

mvahani commented Feb 21, 2023

The big problem we see, with this one, is that the telegraf agent never recovers even if the connection if things are working again. Nothing happens until Telegraf is restarted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort
Projects
None yet
Development

No branches or pull requests

3 participants