GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676

tmodelsk · 2021-06-01T10:51:38Z

Describe the bug
Using GoogleCloudExporter to export metrics with retry_on_failure enabled :
When single metric containing single datapoint in a batch is rejected by GPC monitoring (ex: because one metric has too many labels, but there are many reasons why metric could be rejected),
whole batch exporting is repeated even when all other metrics in batch were exported properly.

As a result, when there is even small percentage of 'corrupted' metrics (ex less < 1%), repeating export of whole batches significantly reduces performance (especially increases memory consumption)
GCP monitoring create timeseries endpoint returns rejected metrics, so only those rejected ones could be repeated.

Steps to reproduce
Example pipeline: prometheusReceiver > batch(200) > googlecloudExporter
Prometheus metrics: one datapoint per 200 has too many labels, ex something like this:

# HELP tm_metric_g_1 Custom Metric 1 Name
# TYPE tm_metric_g_1 gauge
tm_metric_g_1{i="1", j="1", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  1
...
tm_metric_g_1{i="1", j="100", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  100
# HELP tm_metric_g_2 Custom Metric 2 Name
# TYPE tm_metric_g_2 gauge
tm_metric_g_2{i="2", j="1", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  1
...
tm_metric_g_2{i="2", j="99", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  99
# HELP tm_metric_g_3 Custom Metric 3 Name
# TYPE tm_metric_g_3 gauge
tm_metric_g_3{i="3", j="1", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9", l10="10",  l11="11"}  9999

So within batch(200) there are:

two valid metrics with 199 datapoints & having 9 label
one invalid metric with one datapoint having 11 labels

What did you expect to see?
I expect that:

googleCloudExporter uses information returned from GCP create timeseries about metrics accepted & rejected
on retry only rejected metric will be exported.
metrics from batch already exported would be removed from memory (garbage collected), not kept in memory for future retry.

What did you see instead?
Whole batch (with only one rejected metric & datapoint) is kept in memory & exported on retry

What version did you use?
v0.24

The text was updated successfully, but these errors were encountered:

* remove IntHistogram * fix test * rename histogram in golden dataset

github-actions · 2022-11-09T03:55:31Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @tbarker25 @damemi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2022-11-09T13:53:55Z

This is still an issue. To fix this we would need to use the NewMetrics (NewLogs and NewTraces are there as well) error which we can use to return only the elements that failed.

github-actions · 2023-01-09T03:31:28Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @tbarker25 @damemi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2023-01-09T14:17:51Z

We also recommend disabling retry_on_failure for the exporter. The client library used by our exporter actually already handles retries in situations where it is appropriate. We plan to update the default to disabled in the near future.

damemi · 2023-03-21T15:39:43Z

Like @dashpole mentioned above, we disabled retry_on_failure by default and don't recommend using it anymore #19203. Also, our clients will automatically retry where it makes sense (timeout/deadline exceeded), but for malformed metrics that will always fail it just makes sense to drop them.

I also don't think it is possible to find out just the metric points that failed from the GCP endpoint response. It returns the number of successful and failed points, and a message about the failures, but not the points themselves. For example:

{
  "error": {
    "code": 400,
    "message": "One or more TimeSeries could not be written: Request was missing field timeSeries[1].points[0].interval.endTime: The end time of the interval is required.; Request was missing field timeSeries[0].points[0].interval.endTime: The end time of the interval is required.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "One or more TimeSeries could not be written: Request was missing field timeSeries[0].points[0].interval.endTime: The end time of the interval is required.: timeSeries[0]; Request was missing field timeSeries[1].points[0].interval.endTime: The end time of the interval is required.: timeSeries[1]"
      },
      {
        "@type": "type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary",
        "totalPointCount": 2,
        "errors": [
          {
            "status": {
              "code": 3
            },
            "pointCount": 2
          }
        ]
      }
    ]
  }
}

The only way we could retry these specific points is to parse the string error message with a regex or something. However we did add a failure metric in v0.32.

Given that, we can close this issue.

github-actions · 2023-05-22T03:30:49Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @tbarker25 @damemi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-07-21T05:18:29Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

tmodelsk added the bug Something isn't working label Jun 1, 2021

tmodelsk changed the title ~~GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exporter again~~ GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again Jun 1, 2021

tmodelsk mentioned this issue Jun 1, 2021

GoogleCloudExporter: Metric with too many labels should be dropped, not exported & retried #3677

Closed

punya pushed a commit to punya/opentelemetry-collector-contrib that referenced this issue Jul 21, 2021

remove IntHistogram (open-telemetry#3676)

535c18c

* remove IntHistogram * fix test * rename histogram in golden dataset

github-actions bot added the Stale label Aug 16, 2021

bogdandrutu removed the Stale label Aug 18, 2021

alexperez52 referenced this issue in open-o11y/opentelemetry-collector-contrib Aug 18, 2021

remove IntHistogram (#3676)

f46871a

* remove IntHistogram * fix test * rename histogram in golden dataset

dashpole added the comp:google Google Cloud components label May 25, 2022

codeboten added the exporter/googlecloud label Jul 6, 2022

github-actions bot added the Stale label Nov 9, 2022

dashpole removed the Stale label Nov 9, 2022

github-actions bot added the Stale label Jan 9, 2023

github-actions bot removed the Stale label Mar 21, 2023

github-actions bot added the Stale label May 22, 2023

github-actions bot added the closed as inactive label Jul 21, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676

GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676

tmodelsk commented Jun 1, 2021 •

edited

Loading

github-actions bot commented Nov 9, 2022

dashpole commented Nov 9, 2022

github-actions bot commented Jan 9, 2023

dashpole commented Jan 9, 2023

damemi commented Mar 21, 2023

github-actions bot commented May 22, 2023

github-actions bot commented Jul 21, 2023

GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676

GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676

Comments

tmodelsk commented Jun 1, 2021 • edited Loading

github-actions bot commented Nov 9, 2022

dashpole commented Nov 9, 2022

github-actions bot commented Jan 9, 2023

dashpole commented Jan 9, 2023

damemi commented Mar 21, 2023

github-actions bot commented May 22, 2023

github-actions bot commented Jul 21, 2023

tmodelsk commented Jun 1, 2021 •

edited

Loading