-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676
Comments
* remove IntHistogram * fix test * rename histogram in golden dataset
* remove IntHistogram * fix test * rename histogram in golden dataset
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This is still an issue. To fix this we would need to use the NewMetrics (NewLogs and NewTraces are there as well) error which we can use to return only the elements that failed. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
We also recommend disabling retry_on_failure for the exporter. The client library used by our exporter actually already handles retries in situations where it is appropriate. We plan to update the default to disabled in the near future. |
Like @dashpole mentioned above, we disabled retry_on_failure by default and don't recommend using it anymore #19203. Also, our clients will automatically retry where it makes sense (timeout/deadline exceeded), but for malformed metrics that will always fail it just makes sense to drop them. I also don't think it is possible to find out just the metric points that failed from the GCP endpoint response. It returns the number of successful and failed points, and a message about the failures, but not the points themselves. For example:
The only way we could retry these specific points is to parse the string error message with a regex or something. However we did add a failure metric in v0.32. Given that, we can close this issue. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Describe the bug
Using GoogleCloudExporter to export metrics with retry_on_failure enabled :
When single metric containing single datapoint in a batch is rejected by GPC monitoring (ex: because one metric has too many labels, but there are many reasons why metric could be rejected),
whole batch exporting is repeated even when all other metrics in batch were exported properly.
As a result, when there is even small percentage of 'corrupted' metrics (ex less < 1%), repeating export of whole batches significantly reduces performance (especially increases memory consumption)
GCP monitoring create timeseries endpoint returns rejected metrics, so only those rejected ones could be repeated.
Steps to reproduce
Example pipeline: prometheusReceiver > batch(200) > googlecloudExporter
Prometheus metrics: one datapoint per 200 has too many labels, ex something like this:
So within batch(200) there are:
What did you expect to see?
I expect that:
What did you see instead?
Whole batch (with only one rejected metric & datapoint) is kept in memory & exported on retry
What version did you use?
v0.24
The text was updated successfully, but these errors were encountered: