Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

Closed
nielm opened this issue Feb 5, 2024 · 7 comments
Assignees

Comments

@nielm
Copy link

nielm commented Feb 5, 2024

Component(s)

exporter/googlecloud

What happened?

Description

When Google Cloud Monitoring exporter fails to export metrics to Google Cloud Monitoring, it drops the data. This occurs even for transient errors where the attempt should be retried.

Steps to Reproduce

Configure collector, export demo metrics.

Expected Result

Metrics are reliably exported to Google Cloud Monitoring

Actual Result

Metrics are dropped. for transient errors (such as "Authentication unavalialbe" -- when the auth cookie expires and needs to be refreshed)

Collector version

0.93.0

Environment information

Environment

GKE

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
    override: false

  k8sattributes:
  k8sattributes/2:
      auth_type: "serviceAccount"
      passthrough: false
      extract:
        metadata:
          - k8s.pod.name
          - k8s.namespace.name
          - k8s.container.name
        labels:
          - tag_name: app.label.component
            key: app.kubernetes.io/component
            from: pod
      pod_association:
        - sources:
            - from: resource_attribute
              name: k8s.pod.ip
        - sources:
            - from: connection


  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 5s

  memory_limiter:
    # drop metrics if memory usage gets too high
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20

exporters:
  debug:
    verbosity: basic
  googlecloud:
    metric:
      instrumentation_library_labels: false
      service_resource_labels: false

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [k8sattributes, batch, memory_limiter, resourcedetection]
      exporters: [googlecloud]

Log output

2024-02-02T19:49:38.434Z	error	exporterhelper/common.go:95	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = Aborted desc = Errors during metric descriptor creation: {(metric: workload.googleapis.com/cloudspannerecosystem/autoscaler/scaler/scaling-failed, error: Too many concurrent edits to the project configuration. Please try again.)}.", "dropped_items": 4}

2024-02-02T20:24:44.897Z	error	exporterhelper/common.go:95	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 12}

2024-02-05T07:43:53.416Z	error	exporterhelper/common.go:95	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable.", "dropped_items": 17}

Additional context

No response

@nielm nielm added bug Something isn't working needs triage New item requiring triage labels Feb 5, 2024
Copy link
Contributor

github-actions bot commented Feb 5, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

dashpole commented Feb 5, 2024

Unfortunately, it isn't safe to retry failed requests to CreateTimeSeries, as the API isn't idempotent. Retrying those requests often will result in additional errors because the timeseries already exists. The retry policy is determined by the client library here: https://github.com/googleapis/google-cloud-go/blob/5bfee69e5e6b46c99fb04df2c7f6de560abe0655/monitoring/apiv3/metric_client.go#L138.

If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.

I am curious about the Authentication Backend Unavailable error. I haven't seen that one before. Is there anything unusual about your auth setup?

@nielm
Copy link
Author

nielm commented Feb 5, 2024

The retry policy is determined by the client library

Which shows that a CreateTimeSeries RPC is never retried for any condition.

I note that in #19203 and #25900 retry_on_failure was removed from GMP and GCM, because according to #208 "retry was handled by the client libraries", but this was only the case for traces, not metrics. (see comment)

Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?

While I understand that some failed requests should not be retried, there are some that should be: specifically ones that say "Please try again"!

For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.

If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.

This is not trivial as there does not seem to be a config parameter to do this, so would involve editing the source code and compiling my own version... In any case, for a collector running in GCP, exporting to GCM, it

Authentication Backend Unavailable error: Is there anything unusual about your auth setup?

Not at all: running on GKE with workload identity, using a custom service account with appropriate permissions.

If there were retries on Unavailable or Deadline Exceeded, this would not be an issue of course.

@dashpole
Copy link
Contributor

dashpole commented Feb 5, 2024

Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?

No. This was very intentional. It was always wrong to enable retry_on_failure for metrics when using the GCP exporter, and resulted in many complaints about log spam, since a retried request nearly always fails on subsequent requests as well.

For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.

The Too many concurrent edits to the project configuration. error is actually an error from CreateMetricDescriptor, and will be retried next time a metric with that name is exported. It does not affect the delivery of timeseries information, and is only needed to populate the unit and description.

Use

exporters:
  googlecloud:
    timeout: 45s

Sorry, it looks like that option isn't documented. We use the standard TimeoutSettings: https://github.com/open-telemetry/opentelemetry-collector/blob/f5a7315cf88e10c0bce0166b35d9227727deaa61/exporter/exporterhelper/timeout_sender.go#L13 in the exporter.

@dashpole dashpole removed the needs triage New item requiring triage label Feb 5, 2024
@dashpole dashpole self-assigned this Feb 5, 2024
Copy link
Contributor

github-actions bot commented Apr 8, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 8, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024
@AkselAllas
Copy link

AkselAllas commented Aug 30, 2024

Hi @dashpole (Created separate issue as well)

I am experiencing transient otel-collector failures for exporting Trace batches. e.g.:
image

I have:

    traces/2:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.

Stacktrace:

"caller":"exporterhelper/queue_sender.go:101", "data_type":"traces", "dropped_items":200, "error":"context deadline exceeded", "kind":"exporter", "level":"error", "msg":"Exporting failed. Dropping data.", "name":"googlecloud", "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1

Any ideas on what to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants