Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[outputs.influxdb_v2] add exponential backoff, and respect client error responses #8662

Merged
merged 4 commits into from
Jan 27, 2021

Conversation

ssoroka
Copy link
Contributor

@ssoroka ssoroka commented Jan 8, 2021

  • change max wait on error to 30s
  • handle more http response codes, both for future-proofing and for playing nice with gateways and proxies
  • 4xx response codes now result in metrics being dropped, because server will never accept them. This prevents infinite delivery of failing metrics, blocking transmission at head of line.
  • add exponential backoff as well as respecting retry-after header. Pick whichever is more up to 30s max
  • include the server response code in the logs, as well as clarify if the metric was dropped.

resolves #8571

@ssoroka ssoroka requested a review from reimda January 8, 2021 19:14
}
// take the highest value from both, but not over the max wait.
retry := math.Max(backoff, retryAfterHeader)
retry = math.Min(retry, defaultMaxWait)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect telegraf to obey the Retry-After header even if it's longer than defaultMaxWait. If the backend is in trouble and ops needs to quiet down retries, they will increase Retry-After and telegraf should obey it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agreed. I had checked with the db team, and all of their use cases either don't use it (OSS 2.x doesn't use it at all), or it's only used as rate limiting. I suspect it would be prudent to still have a maximum as I wouldn't really want it to sit idle for 2 hours due to an expiry bug. I think we should set defaultMaxWait to whatever we're comfortable with as a max wait time and leave it at that. I can't really see anything above 60s being hugely valuable, but it seems a lot more likely to cause problems.

I'm proposing upping the defaultMaxWait to 60s. let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer trusting the header unconditionally but if you don't want to, it's not a big deal for me because the situation we're talking about is unlikely anyway. Increasing absolute max is the next best thing, whether it's 1 or 5 or 10 minutes.

@sspaink
Copy link
Contributor

sspaink commented Jan 11, 2021

!signed-cla

Copy link
Contributor

@telegraf-tiger telegraf-tiger bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@ssoroka
Copy link
Contributor Author

ssoroka commented Jan 11, 2021

!signed-cla

Copy link
Contributor

@telegraf-tiger telegraf-tiger bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤝 ✅ CLA has been signed. Thank you!

}
// take the highest value from both, but not over the max wait.
retry := math.Max(backoff, retryAfterHeader)
retry = math.Min(retry, defaultMaxWait)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer trusting the header unconditionally but if you don't want to, it's not a big deal for me because the situation we're talking about is unlikely anyway. Increasing absolute max is the next best thing, whether it's 1 or 5 or 10 minutes.

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@srebhan srebhan added area/influxdb feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Jan 26, 2021
@srebhan srebhan self-assigned this Jan 26, 2021
@ssoroka ssoroka merged commit 9c7cf99 into master Jan 27, 2021
@ssoroka ssoroka deleted the influxdbv2-retry branch January 27, 2021 21:07
ssoroka added a commit that referenced this pull request Jan 27, 2021
…or responses (#8662)

* [outputs.influxdb_v2] add exponential backoff, and respect client error responses

* add test

* Update to 60 seconds

* fix test

(cherry picked from commit 9c7cf99)
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
…or responses (influxdata#8662)

* [outputs.influxdb_v2] add exponential backoff, and respect client error responses

* add test

* Update to 60 seconds

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

outputs.influxdb_v2 keeps looping on 500 response.
5 participants