Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel-Arrow receiver admission control broken in several ways #36074

Closed
jmacd opened this issue Oct 29, 2024 · 0 comments · Fixed by #36141
Closed

OTel-Arrow receiver admission control broken in several ways #36074

jmacd opened this issue Oct 29, 2024 · 0 comments · Fixed by #36141
Labels
bug Something isn't working

Comments

@jmacd
Copy link
Contributor

jmacd commented Oct 29, 2024

Component(s)

receiver/otelarrow

What happened?

Description

Several aspects of the OTel-Arrow admission control mechanism are broken or not working as intended.

  1. As a matter of design, the admission.BoundedQueue has a source of complexity, which causes it to have fallible APIs. The Arrow admission path was performing multiple calls to Acquire, once with compressed data and once with uncompressed data. This leads to error handling that could be avoided if it were not for the complication.
  2. The fallible APIs as used in the OTLP code path (internal/{traces,logs,metrics}) are returning control before finishing the call to obsrecv, so no observability happens when admission control fails. This is a major bug.
  3. There is a race condition in the context-cancelled exit path from Acquire(), in case the waiter was already admitted by a concurrent call to Release(). This condition causes the semaphore to leak, potentially. This is a minor bug.
  4. The semaphore is obeying FIFO discipline, but not intentionally. The internal-to-Lightstep code on which this is modeled uses LIFO for reasons documented here. This is not working as intended.

Proposed solution

First, eliminate the complication that necessitates fallible APIs. The problem is the two calls to Acquire() once with compressed size and once with uncompressed size. Because compressed size is typically so much smaller the uncompressed size, the advantage of these two Acquire calls does not outweigh the complexity cost.

Therefore, we can eliminate fallible APIs from the admission package. This may be done by returning a closure from Acquire() to perform the correct release. The potential for mis-use is greatly reduced.

The bounded queue implementation should transition to LIFO. To avoid fallible APIs, transition to LIFO, and fix the race condition is a substantial change. The BoundedQueue tests will be completely rewritten.

Finally, the OTel-Arrow receiver should perform admission control Acquire() once after it computes the uncompressed size of the request, meaning it will stop using the otlp-pdata-size header. The OTel-Arrow exporter should continue to emit this header while older receivers are still in use, but it can be removed eventually.

Collector version

v0.111.0

Environment information

Environment

Any.

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@jmacd jmacd added bug Something isn't working needs triage New item requiring triage labels Oct 29, 2024
jmacd added a commit to jmacd/opentelemetry-collector-contrib that referenced this issue Oct 29, 2024
andrzej-stencel pushed a commit that referenced this issue Oct 31, 2024
#### Description

Adds a no-op implementation of the BoundedQueue used by the OTel-Arrow
receiver for admission control.

#### Link to tracking issue

Part of #36074.

#### Testing

Adds a new end-to-end test to verify admission control with/without
waiters and unbounded. The test was taken from #36033.

#### Documentation

Added: "0" request_limit_mib indicates no admission control.
@andrzej-stencel andrzej-stencel removed the needs triage New item requiring triage label Oct 31, 2024
djaglowski pushed a commit that referenced this issue Nov 1, 2024
#### Description

Simplifies the admission control logic for OTAP payloads. We call
Acquire() once after uncompressing the data, instead of once with
compressed size and once with the difference.

#### Link to tracking issue

Part of #36074.

#### Testing

One test is replaced with logic to verify certain BoundedQueue actions.

~Note: the OTel-Arrow test suite will not pass with this PR until it
merges with #36078.~ Originally developed in #36033.

#### Documentation

Not user-visible.
ArthurSens pushed a commit to ArthurSens/opentelemetry-collector-contrib that referenced this issue Nov 4, 2024
…try#36081)

#### Description

Adds a no-op implementation of the BoundedQueue used by the OTel-Arrow
receiver for admission control.

#### Link to tracking issue

Part of open-telemetry#36074.

#### Testing

Adds a new end-to-end test to verify admission control with/without
waiters and unbounded. The test was taken from open-telemetry#36033.

#### Documentation

Added: "0" request_limit_mib indicates no admission control.
ArthurSens pushed a commit to ArthurSens/opentelemetry-collector-contrib that referenced this issue Nov 4, 2024
…telemetry#36082)

#### Description

Simplifies the admission control logic for OTAP payloads. We call
Acquire() once after uncompressing the data, instead of once with
compressed size and once with the difference.

#### Link to tracking issue

Part of open-telemetry#36074.

#### Testing

One test is replaced with logic to verify certain BoundedQueue actions.

~Note: the OTel-Arrow test suite will not pass with this PR until it
merges with open-telemetry#36078.~ Originally developed in open-telemetry#36033.

#### Documentation

Not user-visible.
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…try#36081)

#### Description

Adds a no-op implementation of the BoundedQueue used by the OTel-Arrow
receiver for admission control.

#### Link to tracking issue

Part of open-telemetry#36074.

#### Testing

Adds a new end-to-end test to verify admission control with/without
waiters and unbounded. The test was taken from open-telemetry#36033.

#### Documentation

Added: "0" request_limit_mib indicates no admission control.
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…telemetry#36082)

#### Description

Simplifies the admission control logic for OTAP payloads. We call
Acquire() once after uncompressing the data, instead of once with
compressed size and once with the difference.

#### Link to tracking issue

Part of open-telemetry#36074.

#### Testing

One test is replaced with logic to verify certain BoundedQueue actions.

~Note: the OTel-Arrow test suite will not pass with this PR until it
merges with open-telemetry#36078.~ Originally developed in open-telemetry#36033.

#### Documentation

Not user-visible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants