Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243

Open
jmacd opened this issue Jan 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jmacd
Copy link
Contributor

jmacd commented Jan 8, 2024

Describe the bug
In my opinion, the Collector's OTLP exporter handles PartialSuccess incorrectly. The present handling of this field is to return an error to the caller, when in fact the opposite behavior was originally intended in the specification. The meaning of PartialSuccess starts with a successful request. The consumer of the data has acknowledged successful receipt of valid data; the delivery pipeline is operational, but something is maybe ill-formed about the data. There is not an error code. This is a qualified success.

As described in open-telemetry/oteps#238, there is an opportunity for Collectors to count the points rejected, as described in PartialSuccess messages, via a new otel.outcome label, which we might call "rejected". Rejected points are those for which errors must be suppressed because some of the data was correct and all of the data was valid. It would be appropriate to return success and count rejected points, which are those in a category different than dropped which cannot be retried and were part of a successful export request.

It would also be appropriate -- and that the second half of this issue -- for Collector pipelines to report this kind of qualified success in such a way that OTel SDKs receive true information about PartialSuccess when they pass OTLP through a Collector pipeline. In other words, I would be happy if we built support to propagate PartialSuccess all the way (backwards) through the pipeline. Also, components should be able to introduce partial success responses themselves, as discussed in #7669.

As an example, consider a batch processor component forked from the core batchprocessor, here: https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor

This component has been extended to block the caller until their own timeout takes place, or until the final outcome of the batch is known. After the batch export response is known, there are two cases:

  1. All of the items were included in the same export, in which case we have all or none failure.
  2. Some of the items were successful (batches were successful), some were not (batches failed)

This is a model case for the use of PartialSuccess -- there was some successful data which we have to acknowledge with a success, and there was some failure which we cannot allow to look like a transient error. It is incorrect to view it as a permanent error. This batch processor could use PartialSuccess to express back to the client exactly how many of its points were accepted.

We must recognize that this information can't always be perfect. If an item is included in a batch and then part of that batch has PartialSuccess w/ rejected_points != 0, then it will not be possible to assign an integer count correctly to the corresponding inputs. See open-telemetry/oteps#238 for more details.

Steps to reproduce
Return data with PartialSuccess to a collector, for example, where one metric point out of 1000 has a problem with validation. This is the case PartialSuccess was meant to address. The collector returns an error when 999 points were correct.

What did you expect to see?
The response should be a success. The pipeline metrics should count 1 rejected items.

What did you see instead?
I see this code trigger:

	if !(partialSuccess.ErrorMessage() == "" && partialSuccess.RejectedSpans() == 0) {
		return consumererror.NewPermanent(fmt.Errorf("OTLP partial success: \"%s\" (%d rejected)", resp.PartialSuccess().ErrorMessage(), resp.PartialSuccess().RejectedSpans()))
	}

with (1 rejected).

What version did you use?
0.91.1

What config did you use?
An otlp/metrics pipeline with

      otelarrow/metrics:
        endpoint: vendorname.com:443

with a vendor that performs metric point validation.

@jmacd jmacd added the bug Something isn't working label Jan 8, 2024
jmacd added a commit to open-telemetry/otel-arrow that referenced this issue Jan 9, 2024
Described and tracked in
open-telemetry/opentelemetry-collector#9243.

The otelarrow export component copied this code from the core OTLP
exporter, so it has the same bug.
@codeboten
Copy link
Contributor

Linking this previous issue here: #6686

@evan-bradley do you recall why the choice was made to return a permanent error on partial success in the PRs to address that issue? I remember some discussions around it, but can't recall the decision

@evan-bradley
Copy link
Contributor

The decision to return a permanent error was made because it will stop retries for the data and because the consumer APIs don't currently have any way to support reporting partial success counts. I think it would also be okay with what currently exists to return success and simply log the details of the partial success message.

The approach here may help communicate this information to the caller: #9041. We could debate whether or not it should be an error type, but it would at least make the caller aware of the existence of a partial success.

@jmacd jmacd changed the title OTLP PartialSuccess responses should not be interpreted as errors OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics Jan 10, 2024
@yurishkuro
Copy link
Member

Vector has an elaborate approach of communicating status of payloads in the pipeline back to the callers - end-to-end acknowledgements.

codeboten pushed a commit that referenced this issue Feb 6, 2024
Changes the treatment of
[PartialSuccess](https://opentelemetry.io/docs/specs/otlp/#partial-success),
making them successful and logging a warning instead of returning an
error to the caller. These responses are meant to convey successful
receipt of valid data which could not be accepted for other reasons,
specifically to cover situations where the OpenTelemetry SDK and
Collector have done nothing wrong, specifically to avoid retries. While
the existing OTLP exporter returns a permanent error (also avoids
retries), it makes the situation look like a total failure when in fact
it is more nuanced.

As discussed in the tracking issue, it is a lot of work to propagate
these "partial" successes backwards in a pipeline, so the appropriate
simple way to handle these items is to return success.

In this PR, we log a warning. In a future PR, (IMO) as discussed in
open-telemetry/oteps#238, we should count the
spans/metrics/logs that are rejected in this way using a dedicated
outcome label.

**Link to tracking Issue:**
Part of #9243

**Testing:** Tests for the "partial success" warning have been added.

**Documentation:** PartialSuccess behavior was not documented. Given the
level of detail in the README, it feels appropriate to continue not
documenting, otherwise lots of new details should be added.

---------

Co-authored-by: Alex Boten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants