Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again #3676

Closed
tmodelsk opened this issue Jun 1, 2021 · 7 comments
Labels

Comments

@tmodelsk
Copy link

tmodelsk commented Jun 1, 2021

Describe the bug
Using GoogleCloudExporter to export metrics with retry_on_failure enabled :
When single metric containing single datapoint in a batch is rejected by GPC monitoring (ex: because one metric has too many labels, but there are many reasons why metric could be rejected),
whole batch exporting is repeated even when all other metrics in batch were exported properly.

As a result, when there is even small percentage of 'corrupted' metrics (ex less < 1%), repeating export of whole batches significantly reduces performance (especially increases memory consumption)
GCP monitoring create timeseries endpoint returns rejected metrics, so only those rejected ones could be repeated.

Steps to reproduce
Example pipeline: prometheusReceiver > batch(200) > googlecloudExporter
Prometheus metrics: one datapoint per 200 has too many labels, ex something like this:

# HELP tm_metric_g_1 Custom Metric 1 Name
# TYPE tm_metric_g_1 gauge
tm_metric_g_1{i="1", j="1", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  1
...
tm_metric_g_1{i="1", j="100", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  100
# HELP tm_metric_g_2 Custom Metric 2 Name
# TYPE tm_metric_g_2 gauge
tm_metric_g_2{i="2", j="1", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  1
...
tm_metric_g_2{i="2", j="99", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9"}  99
# HELP tm_metric_g_3 Custom Metric 3 Name
# TYPE tm_metric_g_3 gauge
tm_metric_g_3{i="3", j="1", l3="3", l4="4", l5="5", l6="6", l7="7", l8="8", l9="9", l10="10",  l11="11"}  9999

So within batch(200) there are:

  • two valid metrics with 199 datapoints & having 9 label
  • one invalid metric with one datapoint having 11 labels

What did you expect to see?
I expect that:

  • googleCloudExporter uses information returned from GCP create timeseries about metrics accepted & rejected
  • on retry only rejected metric will be exported.
  • metrics from batch already exported would be removed from memory (garbage collected), not kept in memory for future retry.

What did you see instead?
Whole batch (with only one rejected metric & datapoint) is kept in memory & exported on retry

What version did you use?
v0.24

@tmodelsk tmodelsk added the bug Something isn't working label Jun 1, 2021
@tmodelsk tmodelsk changed the title GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exporter again GoogleCloudExporter: retry_on_failure on single metric failure : whole batch is exported again Jun 1, 2021
punya pushed a commit to punya/opentelemetry-collector-contrib that referenced this issue Jul 21, 2021
* remove IntHistogram

* fix test

* rename histogram in golden dataset
@github-actions github-actions bot added the Stale label Aug 16, 2021
@bogdandrutu bogdandrutu removed the Stale label Aug 18, 2021
alexperez52 referenced this issue in open-o11y/opentelemetry-collector-contrib Aug 18, 2021
* remove IntHistogram

* fix test

* rename histogram in golden dataset
@dashpole dashpole added the comp:google Google Cloud components label May 25, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2022

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 9, 2022
@dashpole dashpole removed the Stale label Nov 9, 2022
@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

This is still an issue. To fix this we would need to use the NewMetrics (NewLogs and NewTraces are there as well) error which we can use to return only the elements that failed.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 9, 2023
@dashpole
Copy link
Contributor

dashpole commented Jan 9, 2023

We also recommend disabling retry_on_failure for the exporter. The client library used by our exporter actually already handles retries in situations where it is appropriate. We plan to update the default to disabled in the near future.

@github-actions github-actions bot removed the Stale label Mar 21, 2023
@damemi
Copy link
Contributor

damemi commented Mar 21, 2023

Like @dashpole mentioned above, we disabled retry_on_failure by default and don't recommend using it anymore #19203. Also, our clients will automatically retry where it makes sense (timeout/deadline exceeded), but for malformed metrics that will always fail it just makes sense to drop them.

I also don't think it is possible to find out just the metric points that failed from the GCP endpoint response. It returns the number of successful and failed points, and a message about the failures, but not the points themselves. For example:

{
  "error": {
    "code": 400,
    "message": "One or more TimeSeries could not be written: Request was missing field timeSeries[1].points[0].interval.endTime: The end time of the interval is required.; Request was missing field timeSeries[0].points[0].interval.endTime: The end time of the interval is required.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "One or more TimeSeries could not be written: Request was missing field timeSeries[0].points[0].interval.endTime: The end time of the interval is required.: timeSeries[0]; Request was missing field timeSeries[1].points[0].interval.endTime: The end time of the interval is required.: timeSeries[1]"
      },
      {
        "@type": "type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary",
        "totalPointCount": 2,
        "errors": [
          {
            "status": {
              "code": 3
            },
            "pointCount": 2
          }
        ]
      }
    ]
  }
}

The only way we could retry these specific points is to parse the string error message with a regex or something. However we did add a failure metric in v0.32.

Given that, we can close this issue.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 22, 2023
@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants