Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to indicate partial success in an OTLP export response #390

Conversation

joaopgrassi
Copy link
Member

This PR is a first attempt in solving/addressing open-telemetry/opentelemetry-specification#2454.

At first, I tried having "top-level" properties under the Export*ServiceResponse types like so:

message ExportMetricsServiceResponse {
  optional int64 accepted_data_points = 1;
  optional string error_message = 2;
}

But in my view, this has a drawback. OTLP exporters/senders using the new protos would have to always check the explicit presence of the new fields (e.g. HasAcceptedDatapoints) since they don't know which proto version the server is using. Also, since there's more than one field, would it need to check both? What if we add new fields later?

Then I thought having a new type that combines these two new fields is a better compromise. This way, it only needs to check if the new field is not null. If it's not null, we can have some guarantee that the other fields are there (we indicate it with a MUST in the spec, for ex)

I did some tests with sample apps to confirm the cases for backwards compatibility: (code here joaopgrassi/otlp-partial-success-experimental).

1. Old exporters working with receivers/servers using the new protos

All works, as the exporters were not expecting the new fields. They will not have the fields, so they can't inform the user of a partial success - status quo.

2. New exporters working with receivers/servers using the old protos

Exporters MUST check if the Details field is not null to be able to log a message with the details for troubleshooting.

var client = new MetricsService.MetricsServiceClient(channel);
var req = new ExportMetricsServiceRequest();
var response = client.Export(req);

if (response.Details != null)
{
	Console.WriteLine("Number of lines accepted: {0}", response.Details.AcceptedDataPoints);
	Console.WriteLine("Error message: {0}", response.Details.ErrorMessage);
}

3. New exporters working with receivers/servers using the new protos

Since exporters don't know the proto version servers are sending back, it always have to perform the check on point 2 above.

@joaopgrassi joaopgrassi requested a review from a team May 9, 2022 09:11
@joaopgrassi joaopgrassi changed the title Make it possible to indicate partial success in an OTLP export request Make it possible to indicate partial success in an OTLP export response May 9, 2022
@jmacd
Copy link
Contributor

jmacd commented May 10, 2022

Thank you @joaopgrassi, this feature is sorely needed. I have two related requests, which I'll ask for in connection with these protocol changes to see what you think.

1. Structured error details

Lightstep's Prometheus sidecar converts Prometheus WAL logs into OTLP, and Lightstep has fairly strict metric type checking which results in partial successes.. Because we had no way to convey these, I used gRPC/HTTP response headers. The Lightstep backend produces a sample of the metric names (max 1 per request) that are dropping data as a result; this works because the headers contain structured information, e.g. a header like:

otlp-invalid-${failed_metric_name}: failure explanation

The recipient of these headers turns them into metrics about failed metrics, which aggregated over time probabilistically contain all the failing metrics even though each response is limited in size. The feedback is that a single, unstructured error details string is possibly less useful than it could be. On the other hand, OTel has a data model for error logs: could this be a structured error message in the form of an OTLP log message returned by the server? Can we have semantic conventions for conveying sample errors that support aggregation as described above?

2. Client-side telemetry consumed/dropped metrics for OTLP exporters

It is expensive to imagine printing the error details to the console, especially if they're repetitive. This leaves me wondering what the user is expected to do with the error details here, otherwise. Assuming the SDK supports logging, it's likely that the user is simply going to drop the error details back in a payload to the same service or a connected backend service. I would like us to discourage OTel users from echoing error details from server-to-client-to-server again, meaning: OTLP consumers SHOULD consider delivering the error_message to the user of their system directly and OTLP producers SHOULD NOT echo error details back through a default-configured logs SDK for the same process. If the error details log is failing itself, we do not want to create a cycle, but practically speaking that is all that we can expect from an unstructured message.

Users who monitor their OTLP exporters want to know how many spans/metrics/logs are being written and how many are being dropped. Partial successes indicate how many points are dropped when the request succeeds, but we need semantic conventions for counting points that are dropped for all reasons, which include total failures too. Again referring to the OTel Prometheus sidecar linked above as an example, I think we should declare metrics with names like otlp.metrics.consumed (1 per successful point), otlp.metrics.dropped (1 per failed point), otlp.spans.consumed (1 per successful span), otlp.spans.dropped (1 per failed span), and so on. The OTLP exporter will be expected to use the partial_success information about how many spans/metric points/logs succeeded as well as knowledge of how many were sent to generate these metrics. These metrics would be the first semantic conventions for OTLP SDKs to generate metrics about themselves; since these are meta-telemetry metrics I think they should be always use cumulative temporality, that way telemetry produced/dropped can be accounted for during network failures, for example. Unlike with logging, metrics cannot create a cycle because the signal is aggregated--if these metrics are dropped they will be accounted for as ordinary dropped metric points.

@tigrannajaryan
Copy link
Member

Do ExportLogsPartialSuccess, ExportMetricsPartialSuccess and ExportTracesPartialSuccess need to be 3 separate message or we can have one ExportPartialSuccess message with accepted_records field?

Do we anticipate that these 3 may need to evolve in the future in different ways?

@jack-berg
Copy link
Member

The recipient of these headers turns them into metrics about failed metrics, which aggregated over time probabilistically contain all the failing metrics even though each response is limited in size. The feedback is that a single, unstructured error details string is possibly less useful than it could be.

Can the client actually act on that information? And if it can, is conveying it to the client in a response such that it gets logged to standard out the best way to convey the information? As an alternative, the backend itself could track metrics about rejected data, and surface alert the user to it in the UI.

we need semantic conventions for counting points that are dropped for all reasons, which include total failures too

👍 Otel Java already records some experimental metrics about the counts of spans, metrics, and logs successfully / unsuccessfully exported via OTLP. Also have similar metrics we record about batch span / log processor successes and failures.

I think they should be always use cumulative temporality, that way telemetry produced/dropped can be accounted for during network failures, for example.

Disagree, since that implies that instrumentation has some say over the temporality. Could do something tricky like use an up down counter instead of a counter though 😁. This is a bit off topic though.

@joaopgrassi
Copy link
Member Author

@jmacd thanks for the review! I will try replying to your points:

It is expensive to imagine printing the error details to the console, especially if they're repetitive. This leaves me wondering what the user is expected to do with the error details here, otherwise. Assuming the SDK supports logging, it's likely that the user is simply going to drop the error details back in a payload to the same service or a connected backend service. [...]

That's a good point, that didn't cross my mind.

The initial discussions in the issue lead us to think on rather simplistic cases. Having the exporter log the error message to stdout helps to quickly address things without having to go to my vendor UI and browser for logs, but sure, it's probably not a good practice and not a behavior we should encourage.

Users who monitor their OTLP exporters want to know how many spans/metrics/logs are being written and how many are being dropped. Partial successes indicate how many points are dropped when the request succeeds, but we need semantic conventions for counting points that are dropped for all reasons, which include total failures too [...]

My view is that back-ends probably should also keep track of these metrics, since they are the ones actually doing the "parsing" of OTLP data, in whatever ways they do, but we are not the ones to tell them what to do I guess.

About introducing these "self monitoring" metrics to OTLP exporters, I think it makes sense, and we probably should do that (most likely not as part of this though). But I have one concern about

Unlike with logging, metrics cannot create a cycle

If we say exporters produce self monitoring metrics using whatever is returned in partial_success, couldn't it become also a cycle, if the produced self monitoring metrics also cause datapoints to be dropped? I'm just thinking out loud now, but what if I reached some sort of quota in my back-end, that causes the datapoints to be dropped (or any other condition that would lead to this)? Then further producing more, will create a cycle, no? Also, if we go this route, we need to make sure users can opt-out of this, since ingesting these sort of data costs money. Potentially even having granularity to opt-out for certain metrics only rather than a "all-or-nothing" switch.

About the topic around having a more structured response and not simply strings: This was briefly touched in the issue, and I agree with you that it would make sense, at least for metrics. For example, for metrics we could have something like:

message ExportMetricsPartialSuccess {
  // The number of accepted data points
  int64 accepted_data_points = 1;

  // A developer-facing human-readable error message in English. It should
  // both explain the error and offer an actionable resolution to it.
  string error_message = 2;

  repeated RejectedLogRecord rejected_datapoints  = 3;
}


message RejectedMetric {
  string error_message = 1; // human-readable reason/message about why it was rejected
  string metric_key = 2;
  uint32 index = 3
  ...
}

That would work I think, since we have the metric key + index of the datapoint it's easy to pin-point/identify it, also easy for users since they probably know their metric keys right away. But I'm not sure how could we achieve the same for Spans and Logs. How do we identify a span/log then? In a myriad of them, even an index might not be enough..

@joaopgrassi
Copy link
Member Author

joaopgrassi commented May 13, 2022

Do ExportLogsPartialSuccess, ExportMetricsPartialSuccess and ExportTracesPartialSuccess need to be 3 separate message or we can have one ExportPartialSuccess message with accepted_records field?

Do we anticipate that these 3 may need to evolve in the future in different ways?

@tigrannajaryan

From just the thread here around having more structure, I'm leaning more in having separated types. It makes things more obvious and if we later need different things for each Signal, we are prepared. We already had different types anyway so it makes sense to me :)

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as the first step to introduce partial success capabilities. I think we can accept it and adopt in OTLP implementations as is.
If we see the need for more detailed description of partial success it can be added as backward-compatible change in the future.

@joaopgrassi
Copy link
Member Author

joaopgrassi commented May 23, 2022

I pushed a commit (4bd818f) that changed the property name to partial_success instead of details. I think that's better, thanks @jmacd.

In regards to the rest of comments/reviews, I summarized the general intention/requirements as:

  1. It's clear that in the long run we'll need a more structured response and not only the num of accepted + message as this PR suggests

  2. Simply logging the top-level error_message might be undesirable in the long run, but it's already better than the status quo - Today back-ends have to return a 400 if they want to return any information on why some data was rejected. With this PR as is, they can return a 200 with the potential errors in the payload as a string. Exporters can decide if they log that or not - it's already not consistent as some exporter do log and others don't when receiving a 400.

  3. For some signals (Metrics) the "extra" structure might be already clear (metric name + index of the failing datapoint + a reason/error why. But for Spans and Logs it's a bit different. We might also need to consider the Scope + Resource in order to uniquely identify a failing item.

  4. A more structured response allows exporters to generate "self-monitoring" metrics (e.g. otlp.metrics.consumed, otlp.metrics.dropped). This would help application owners/operators to know how their exporters are doing and is a much better approach than simply logging the error message to stdout.
    4.1 This would be a whole work on its own, as we first need to define semantic conventions for it

  5. Automatic retry by senders based on partial_success information would be tricky to achieve as there's too many nuances and the issues that can occur depend highly on the back-end used. As I see, most problems that can occur with ingesting metrics for example either have to do with limit quotas, or "bad data" - A retry for those is not helpful. For other scenarios where a retry would make sense (server temporarily unavailable, throttling) the OTLP spec already defines what senders/receivers should do.

To move forward and address all points I suggest:

Points 1 and 2: Go forward with this PR as is, modify the spec about partial success, adding recommendations of what senders/receivers should do with this new property (how servers should populate this, and how exporters should use it)

Point 3: Create an issue to track discussion about the structure, provide suggestions brought up here and brainstorm. Potentially agreeing in changes to be done in the proto and send a follow-up PR for it

Point 4. Create an issue to track discussion about such self-monitoring metrics, potentially agreeing in which we want/need and sending a follow-up PR against the spec for it

If everyone agrees, I will go and create such issues and kick-start the discussion.

@jmacd
Copy link
Contributor

jmacd commented May 23, 2022

@tigrannajaryan asked:

Do ExportLogsPartialSuccess, ExportMetricsPartialSuccess and ExportTracesPartialSuccess need to be 3 separate message or we can have one ExportPartialSuccess message with accepted_records field?

I like this idea. Can we have one type, instead of three?

@joaopgrassi
Copy link
Member Author

I like this idea. Can we have one type, instead of three?

I'm not sure to be honest. For the fields I have in this PR now it makes sense to have only one type to rule them all. But, if we decide to include more structure later, the types might differ by signal and I wonder if there wouldn't be any problems splitting them up afterwards? I guess the only way then to avoid breaking existing code at compile-time would be to always keep the "old" single type, and introduce new ones. That way the field numbers are kept and the new types can have other fields, but both should work in old/new receivers. Does that make sense?

@jmacd
Copy link
Contributor

jmacd commented May 23, 2022

@joaopgrassi How do you feel about replacing the error_message with a structured return value? I think it would be convenient to allow future specification of conventions here while allowing progress now.

I have in mind using the opentelemetry.proto.logs.v1.LogRecord object directly, which is essentially a AnyValue (body) and repeated KeyValue (attributes). Given that OpenTelemetry SDKs are likely to have access to a standard-error logger implementation, it should be a small refactoring to encapsulate a ToErrorMessage(LogRecord) function. An implementation that wants to dump these messages to the console could do so, meanwhile vendors that want to improve support for conveying and counting specific kinds of error can do so without using another protocol.

@joaopgrassi
Copy link
Member Author

@jmacd hum.. I'm not sure. If the goal is to have structure and be able to achieve something with it, isn't relying on AnyValue and a list of KeyValue counterintuitive? How would I know what to expect in those blobs of data? Wouldn't that lead to code having to perform checks to find out what it's inside? I'm not sure if that's better than a simply string with the error.

Just to be clear, I'm not against having more structure and I think all your use cases are valid. I'm just not sure if we have enough knowledge as of know to be able to define such structure, so in this part I agree with @tigrannajaryan where maybe a simple first version is enough. I might be wrong as well.

I just think that if we really want more structure, introducing proper types with well-defined properties makes more sense. I think this makes the work of deriving whatever we want from the partial success response easier instead of having to check in a map of things. WDYT?

@jmacd
Copy link
Contributor

jmacd commented May 25, 2022

How would I know what to expect in those blobs of data? Wouldn't that lead to code having to perform checks to find out what it's inside? I'm not sure if that's better than a simply string with the error.

If we return a simple string, I think we're not helping the user with their telemetry problem -- the error message just turns one telemetry problem into a new telemetry problem, which has to be reported as an exception for a human to parse. For well-known error conditions that we decide are worthwhile to handle, this does mean parsing the data to access what's inside. However, for a recipient that does not care to perform special checks (or an older version), the structured message is just a simple message. Having structure lets us define semantic conventions later while still conveying error messages now; it's backwards compatible.

I just think that if we really want more structure, introducing proper types with well-defined properties makes more sense.

I guess the point is that while I'm asking for structure, the data is still just an error message, and the processing I'm imagining is to improve how telemetry-induced error messages are displayed. I would support using new fields for new kinds of information--not error messages--but the message field is just to explain the partial failure--structured or not.

@joaopgrassi
Copy link
Member Author

joaopgrassi commented May 31, 2022

If we return a simple string, I think we're not helping the user with their telemetry problem -- the error message just turns one telemetry problem into a new telemetry problem, which has to be reported as an exception for a human to parse.

@jmacd yes, I completely agree with you on this. As I mentioned earlier, I'm all in for a well structured response other than a simple string.

What I'm not sure about is (maybe I'm not getting it) if LogRecord is a good candidate for it. I kinda see the value in the AnyValue body + the attributes list idea. It is definitely flexible since we can just add anything there (reason, error codes, etc) but what I'm trying to challenge here is: Isn't too much flexibility also bad? Doesn't this approach complicate things on the senders side? Since it can basically hold anything, we would have to define semantics and everything around that to make sure senders that want to handle it better can do so in a consistent way.

That's why I'm thinking if we shouldn't be considering having a type with well defined properties that will allow senders derive whatever we decide it's important/relevant? e.g. your example with self monitoring metrics.

Also: LogRecord contains other fields like trace id, span id and timestamp which I'm not sure are relevant for this. Unless you were not suggesting to actually use this type but just copy those two fields? (body and attributes)?

However, for a recipient that does not care to perform special checks (or an older version), the structured message is just a simple message. Having structure lets us define semantic conventions later while still conveying error messages now; it's backwards compatible.

Yes, this is true for "both approachs" we are discussing - using the LogRecord or adding well-defined fields in the new types introduced in this PR. Recievers that don't care (or old ones) will simply not care for the message.

@jmacd
Copy link
Contributor

jmacd commented Jun 1, 2022

Also: LogRecord contains other fields like trace id, span id and timestamp which I'm not sure are relevant for this.
Unless you were not suggesting to actually use this type but just copy those two fields? (body and attributes)?

I was being vague because either approach is OK for me, and I'm trying to point out that this return value is in every respect just a log message. There are optional fields in the LogRecord that I do not imagine being useful for a server-to-client response, this is true.

How about using severity to return info (e.g., "info: welcome, your quota is 1000 points/hour") warnings w/o dropping any points ("warning: floating-point-to-integer-conversion is happening")?

If the exporter is being traced itself, there is a well-defined span/trace associated with the request and it could be used in the error message.

The timestamp could tell you that the server has a lot of clock skew relative to the client. This is important if your error message is that points are not being accepted due to out-of-range timestamps.

@jmacd
Copy link
Contributor

jmacd commented Jun 1, 2022

@joaopgrassi

Since it can basically hold anything, we would have to define semantics and everything around that to make sure senders that want to handle it better can do so in a consistent way.

The same is true for a bare string called "error_message". It can hold anything and it can be printed. I think you are reasoning that the client-side logic for say reporting which metrics are experiencing errors will be simpler if we spell it out in .proto structure, as opposed to semantic-conventional structure. If we take the explicit-structure route, it means that old clients sending to new servers won't see any benefit until they upgrade. If we structure an error message with semantic conventions, it prints as an error message to the old client while it serves as both an error message and structured detail to the new client.

Either way, I don't want the structure of the error message to block this work. I care about monitoring the number of dropped metric points much more than the client being able to identify which metric names are being dropped. (After all, we are able to report which metric names are being dropped directly through our SaaS, client's don't need to track this detail.)

@joaopgrassi
Copy link
Member Author

joaopgrassi commented Jun 2, 2022

@jmacd

If the exporter is being traced itself, there is a well-defined span/trace associated with the request and it could be used in the error message.

Hum I think I'm a bit confused by this. The server/receiver is the one that will ingest and produce the partial_success response (if necessary) to senders, how would it include a spanid in the response? The exporter might be traced, yes, but I'm failing to see how that resonates with the response the server is producing?

The same is true for a bare string called "error_message". It can hold anything and it can be printed. I think you are reasoning that the client-side logic for say reporting which metrics are experiencing errors will be simpler if we spell it out in .proto structure, as opposed to semantic-conventional structure.

Yes. I personally prefer to have explictly defined fields in .protos instead of relying on conventions ,be it on a string, or via attributes we may define. I think it just makes the code harder and more complicated + open for mistakes as implementors have to follow the conventions correctly. If we offer concrete fields, there's no way to "mess it up".

If we take the explicit-structure route, it means that old clients sending to new servers won't see any benefit until they upgrade. If we structure an error message with semantic conventions, it prints as an error message to the old client while it serves as both an error message and structured detail to the new client.

The way I see it, old clients/senders cannot use anything we are defining here anyway. Both approaches (having explicit-structure vs string-error message with conventions) requires adding new fields to the Export*ServiceResponse proto messages. Old clients will not have the new fields so, nothing we can do. As I mentioned in the OP along with the test app, only new clients/receivers can parse it, but they also always need to check if it exists, as they might be talking with old servers/receivers.

Either way, I don't want the structure of the error message to block this work. I care about monitoring the number of dropped metric points much more than the client being able to identify which metric names are being dropped. (After all, we are able to report which metric names are being dropped directly through our SaaS, client's don't need to track this detail.)

Yes, agreed. That is possible with the current proposal, as we can derive the number of rejected datapoints, spans, logs. As mentioned, I think that is a sensible first start, and we can discuss about the structure in separated issues as I suggested.

@joaopgrassi
Copy link
Member Author

joaopgrassi commented Jun 9, 2022

@jmacd @bogdandrutu, @reyang @jack-berg, @tigrannajaryan et al. pinging to see if we have reached a consensus here. My understanding of the current state of discussions is:

The current PR is a good start and improves the current status quo because:

  • Allows receivers to partially accept data
  • Allows receivers to return some information to senders about the partial success - Today, the only possibility is returning a 400 where something can be added to the response body
  • Allows senders to potentially derive metrics around number of dropped logs/spans/datapoints

But it is not where we want to stay because:

  • Only a string with an error message is not so useful - will become a log in the receivers and that's it
  • We can't do more complex handling of a partial success - derive more metrics, maybe even retrying where we can

As I suggested above, we could go with this as is as it's already a great improvement from the status quo. Then I could create separated issues to discuss what more complex scenarios we want to handle per signal and then go from there and add a more complex structure to the messages introduced here.

What do you all think?

@tigrannajaryan
Copy link
Member

What do you all think?

The PR looks good to me. I already approved. The plan for the future also sounds good to me.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care for the error details field, but the partial success are important.

@joaopgrassi
Copy link
Member Author

joaopgrassi commented Jun 21, 2022

I created an issue to continue the discussions from here around more structure and next steps for metrics partial success responses. #404.

@carlosalberto
Copy link
Contributor

Ping @bogdandrutu - waiting for one more review from you ;)

@jmacd
Copy link
Contributor

jmacd commented Jul 11, 2022

Looks good! Let's get open-telemetry/opentelemetry-specification#2636 approved and merged first.

// The number of accepted log records
//
// If the request is partially successful and the number of
// accepted log records is known, the server MUST set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the situation when the number of accepted log records is unknown and why is it useful for the sender to know about this situation specifically? If there is no such situation I suggest that we get rid of this and require the server always sets this number (if it doesn't know then set it either to 0 or to total number - either way is fine). Doing so will eliminate the need to have optional accepted_log_records (optional fields are poorly supported in some Protobuf libraries and are a pain to deal with).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the situation when the number of accepted log records is unknown and why is it useful for the sender to know about this situation specifically

I also tend to think this shouldn't happen. If the server doesn't know the number of accepted "things", it should return an empty response, without populating the partial_success field, just as it is done today.

The discussion around optional was without it, a server could initialize the partial_success property, set the error_message and leave the accepted_<signal> field with the default value, which is 0. In this situation, it would not be possible to differentiate a 0 being the the default/unset from the server actually dropping everything.

@jmacd had some other alternatives to this here: #390 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should:

  • get rid of the optional for accepted_<signal> field
  • require that the accepted_<signal> field value is always set (if partial_success is set). There is no such thing as "unknown" in the response. If it is unknown then the server needs to choose something (0 or equal to request count are both fine choices). The sender doesn't care about this "unknown" state since it can't do anything about it and thus it is unnecessary to signal the "unknown" state in the protocol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require that the accepted_ field value is always set (if partial_success is set)

Yes, I agree and that's what I wrote in the spec now. Although we still have the issue where a sender receives a partial_success with something in the error message and 0 for accepted_<signal> and it then must interpret it as "nothing was accepted" which is clear to me, but could be open to other interpretations, since 0 is also the not set/default value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see optional qualifier for accepted_log_records field. Do you plan to remove it or you disagree with my opinion that we don't need it?

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan I didn't remove the optional in any for now. I wanted to see what others think before changing it, specially since there is this "nuance" with the default/not set value being the same thing. I know from talking with @jmacd that he had concerns with this double meaning. I think I agree with you that if the spec says to always populate it, if you get a 0 it means nothing was accepted so I'm in favor of removing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should: get rid of the optional for accepted_ field require that the accepted_ field value is always set (if partial_success is set). There is no such thing as ...

The problem is that "partial success is set" cannot be answered because without optional there is no such thing as field presence for messages. If the server responds with a partial_success message having 0 for the accepted_things field and an empty message, the receiver will see an unset partial_success. It should be possible to meaningfully indicate that 0 points were accepted, and we can't rely on field presence without optional. Otherwise, we need to infer that partial_success was "set" because it has a non-default field.

My goal behind supporting this PR is to get to a place where clients have exact information about how many points are being successfully sent vs. dropped. There should be no ambiguity. Moreover, we should not have to fill in a struct when 100% of points are accepted, because it creates an extra burden on the sender. In order to have a meaningful zero and gain the performance benefit without optional, we need to change again this PR to count "dropped" things instead of "accepted" things.

I support removing optional, but all the "accepted_" fields should revert back to "dropped_" fields. The 0 value will always be meaningful, then.

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmacd, I can do the change and submit a new PR if that's what everyone agrees, no problem.

But I wonder, if we will not have the same dilemma again. If the server responds with a partial_success with rejected_<signal> = 0 and error_message = " what does that mean? Nothing was dropped? Then why did the server even initialized the partial_success, which it shouldn't? Then we can argue that we will write in the spec that this shouldn't happen but isn't it what is already written. just the other way around?

It should be possible to meaningfully indicate that 0 points were accepted

The current spec change says if everything was accepted, partial_success should not be set, so I feel it's already enough if you get a partial_success and accepted_<signal> is 0 then nothing was accepted. Sure, with optional there's more certainty but without it, I feel we are also covered, like @tigrannajaryan mentioned.

Moreover, we can't enforce what servers will do, only recommend what it should do. I feel with what we have written in the spec, we already have it covered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server responds with a partial_success with rejected_ = 0 and error_message = " what does that mean? Nothing was dropped? Then why did the server even initialized the partial_success, which it shouldn't?

There is effectively no test that distinguishes "server initialized the partial_success" vs not when the default-value is used, short of using optional.

Without optional, there are two choices:

  • We either can infer that the number-value is meaningful because the string-value is not empty
  • We can declare that the number-value is always meaningful.

We cannot differentiate a field that is set to the default value from a field that was not set. To your question, when the partial success field is not set, as it will not be for older OTLP versions, I will assume nothing dropped. I do not believe there is any other valid interpretation, since we have not yet specified any form of partial success. Until we specify partial success, you have all-or-none semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is reasonable semantics:

  1. If partial_success is not set then it means the request is processed fully successfully. This also happens to be the old behavior, so no changes in the semantics, which is good - we are backward compatible.
  2. If partial_success is set then it means something went wrong. The accepted_<signal> field inside Export*PartialSuccess message MUST be set. 0 is a valid value for accepted_<signal> field and means all records failed to process. error_message MUST be also set to indicate what the problem was. Empty string happens to be a valid value, meaning that there is no additional error text to provide, however the entire situation is still erroneous.

I think this is all we need, it does not need to be more complicated than this.

How this is used in the Server:

  1. If the Server processed everything successfully then don't set partial_success.
  2. If processing of any record failed the set partial_success and calculate accepted_<signal> field value as total records in the record-failed to process record. Again, 0 is perfectly fine value to end up with. Also set error_message to indicate the failure reason (can be as simple as "some metric data points are invalid" or more complex if desirable, e.g. one line per failed record with precise failure reason).

How this is used in the Client:

  1. If partial_success in the response is not set, good, nothing else to do.
  2. If partial_success in the response is set then print an error message in the log and include error_message. Also log accepted_<signal> value. Can also keep a metric based on accepted_<signal> value if desirable.

I may be missing something here. Perhaps there is a scenario that is not covered here and where the Server or the Client need to make a different decision and behave differently?

@tigrannajaryan tigrannajaryan self-requested a review July 12, 2022 14:28
Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed here, this PR needs to change based on feedback. I think we should close this PR, make the proposed changes to remove optional and ensure 0 values are meaningful in the specification PR open-telemetry/opentelemetry-specification#2636.

@jmacd
Copy link
Contributor

jmacd commented Jul 13, 2022

The review thread for this PR is too long to effectively read. Please close and re-open after more progress in open-telemetry/opentelemetry-specification#2636.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants