Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processor batch size will grow endlessly on error #1833

Closed
alexrudd opened this issue Apr 22, 2021 · 4 comments · Fixed by #1860
Closed

Batch processor batch size will grow endlessly on error #1833

alexrudd opened this issue Apr 22, 2021 · 4 comments · Fixed by #1860
Assignees
Labels
area:trace Part of OpenTelemetry tracing bug Something isn't working help wanted Extra attention is needed pkg:SDK Related to an SDK package
Milestone

Comments

@alexrudd
Copy link

Description

Hey, came across this bug as I was sending invalid utf-8 strings through the gRPC driver. The invalid spans would cause the batch to roll-over and be retried. If the batch size is already at/over max then the batch size check fails permanently and the batch will only be sent again after the timeout. As the batch grows it risks hitting gRPC message size limits imposed by the server.

How this looks with some debug logging wrapped around the driver:

2021/04/22 15:27:59 ERROR SENDING BATCH (size: 512): rpc error: code = Internal desc = grpc: error unmarshalling request: string field contains invalid UTF-8
2021/04/22 15:28:04 ERROR SENDING BATCH (size: 5342): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5389485 vs. 5000000)
2021/04/22 15:28:08 ERROR SENDING BATCH (size: 8177): exporter disconnected: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5389485 vs. 5000000)
2021/04/22 15:28:13 ERROR SENDING BATCH (size: 8177): exporter disconnected: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5389485 vs. 5000000)
2021/04/22 15:28:19 ERROR SENDING BATCH (size: 8177): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (8174169 vs. 5000000)

Environment

  • OS: Alpine
  • Architecture: i386
  • Go Version: 1.16.2
  • opentelemetry-go version: 0.19.0

Steps To Reproduce

I don't have a repro setup outside my work project. It should be possible to reproduce using unit tests on the batch span processor though.

Expected behavior

Either for the failed batch to be dropped, or some more intelligent retry with eventual discard

@alexrudd alexrudd added the bug Something isn't working label Apr 22, 2021
@Aneurysm9 Aneurysm9 added area:trace Part of OpenTelemetry tracing pkg:SDK Related to an SDK package priority:p2 labels Apr 22, 2021
@Aneurysm9
Copy link
Member

I think dropping the failed batch may be the better of two bad options. Attempting to retry in the batch span processor may compound load on downstream systems when combined with exporters that also attempt to retry. I'm also not sure that the batch span processor has any ability to reliably determine which errors should be retried and which will never succeed.

@MrAlias
Copy link
Contributor

MrAlias commented Apr 22, 2021

We should verify if there is any guidance from the specification here.

@MrAlias MrAlias added the help wanted Extra attention is needed label Apr 22, 2021
@Aneurysm9
Copy link
Member

From the SpanExporter Export(batch) spec

Any retry logic that is required by the exporter is the responsibility of the exporter. The default SDK SHOULD NOT implement retry logic, as the required logic is likely to depend heavily on the specific protocol and backend the spans are being sent to.

I'd take this to mean that the Batch Span Processor should not attempt to retry.

@paivagustavo
Copy link
Member

If no one is working on this, I'll start working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:trace Part of OpenTelemetry tracing bug Something isn't working help wanted Extra attention is needed pkg:SDK Related to an SDK package
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants