-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it possible to indicate partial success in an OTLP export response #390
Make it possible to indicate partial success in an OTLP export response #390
Conversation
Thank you @joaopgrassi, this feature is sorely needed. I have two related requests, which I'll ask for in connection with these protocol changes to see what you think. 1. Structured error detailsLightstep's Prometheus sidecar converts Prometheus WAL logs into OTLP, and Lightstep has fairly strict metric type checking which results in partial successes.. Because we had no way to convey these, I used gRPC/HTTP response headers. The Lightstep backend produces a sample of the metric names (max 1 per request) that are dropping data as a result; this works because the headers contain structured information, e.g. a header like:
The recipient of these headers turns them into metrics about failed metrics, which aggregated over time probabilistically contain all the failing metrics even though each response is limited in size. The feedback is that a single, unstructured error details string is possibly less useful than it could be. On the other hand, OTel has a data model for error logs: could this be a structured error message in the form of an OTLP log message returned by the server? Can we have semantic conventions for conveying sample errors that support aggregation as described above? 2. Client-side telemetry consumed/dropped metrics for OTLP exportersIt is expensive to imagine printing the error details to the console, especially if they're repetitive. This leaves me wondering what the user is expected to do with the error details here, otherwise. Assuming the SDK supports logging, it's likely that the user is simply going to drop the error details back in a payload to the same service or a connected backend service. I would like us to discourage OTel users from echoing error details from server-to-client-to-server again, meaning: OTLP consumers SHOULD consider delivering the error_message to the user of their system directly and OTLP producers SHOULD NOT echo error details back through a default-configured logs SDK for the same process. If the error details log is failing itself, we do not want to create a cycle, but practically speaking that is all that we can expect from an unstructured message. Users who monitor their OTLP exporters want to know how many spans/metrics/logs are being written and how many are being dropped. Partial successes indicate how many points are dropped when the request succeeds, but we need semantic conventions for counting points that are dropped for all reasons, which include total failures too. Again referring to the OTel Prometheus sidecar linked above as an example, I think we should declare metrics with names like |
Do ExportLogsPartialSuccess, ExportMetricsPartialSuccess and ExportTracesPartialSuccess need to be 3 separate message or we can have one ExportPartialSuccess message with Do we anticipate that these 3 may need to evolve in the future in different ways? |
Can the client actually act on that information? And if it can, is conveying it to the client in a response such that it gets logged to standard out the best way to convey the information? As an alternative, the backend itself could track metrics about rejected data, and surface alert the user to it in the UI.
👍 Otel Java already records some experimental metrics about the counts of spans, metrics, and logs successfully / unsuccessfully exported via OTLP. Also have similar metrics we record about batch span / log processor successes and failures.
Disagree, since that implies that instrumentation has some say over the temporality. Could do something tricky like use an up down counter instead of a counter though 😁. This is a bit off topic though. |
@jmacd thanks for the review! I will try replying to your points:
That's a good point, that didn't cross my mind. The initial discussions in the issue lead us to think on rather simplistic cases. Having the exporter log the error message to stdout helps to quickly address things without having to go to my vendor UI and browser for logs, but sure, it's probably not a good practice and not a behavior we should encourage.
My view is that back-ends probably should also keep track of these metrics, since they are the ones actually doing the "parsing" of OTLP data, in whatever ways they do, but we are not the ones to tell them what to do I guess. About introducing these "self monitoring" metrics to OTLP exporters, I think it makes sense, and we probably should do that (most likely not as part of this though). But I have one concern about
If we say exporters produce self monitoring metrics using whatever is returned in About the topic around having a more structured response and not simply strings: This was briefly touched in the issue, and I agree with you that it would make sense, at least for metrics. For example, for metrics we could have something like: message ExportMetricsPartialSuccess {
// The number of accepted data points
int64 accepted_data_points = 1;
// A developer-facing human-readable error message in English. It should
// both explain the error and offer an actionable resolution to it.
string error_message = 2;
repeated RejectedLogRecord rejected_datapoints = 3;
}
message RejectedMetric {
string error_message = 1; // human-readable reason/message about why it was rejected
string metric_key = 2;
uint32 index = 3
...
} That would work I think, since we have the metric key + index of the datapoint it's easy to pin-point/identify it, also easy for users since they probably know their metric keys right away. But I'm not sure how could we achieve the same for Spans and Logs. How do we identify a span/log then? In a myriad of them, even an index might not be enough.. |
From just the thread here around having more structure, I'm leaning more in having separated types. It makes things more obvious and if we later need different things for each Signal, we are prepared. We already had different types anyway so it makes sense to me :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as the first step to introduce partial success capabilities. I think we can accept it and adopt in OTLP implementations as is.
If we see the need for more detailed description of partial success it can be added as backward-compatible change in the future.
I pushed a commit (4bd818f) that changed the property name to In regards to the rest of comments/reviews, I summarized the general intention/requirements as:
To move forward and address all points I suggest: Points 1 and 2: Go forward with this PR as is, modify the spec about partial success, adding recommendations of what senders/receivers should do with this new property (how servers should populate this, and how exporters should use it) Point 3: Create an issue to track discussion about the structure, provide suggestions brought up here and brainstorm. Potentially agreeing in changes to be done in the proto and send a follow-up PR for it Point 4. Create an issue to track discussion about such self-monitoring metrics, potentially agreeing in which we want/need and sending a follow-up PR against the spec for it If everyone agrees, I will go and create such issues and kick-start the discussion. |
@tigrannajaryan asked:
I like this idea. Can we have one type, instead of three? |
I'm not sure to be honest. For the fields I have in this PR now it makes sense to have only one type to rule them all. But, if we decide to include more structure later, the types might differ by signal and I wonder if there wouldn't be any problems splitting them up afterwards? I guess the only way then to avoid breaking existing code at compile-time would be to always keep the "old" single type, and introduce new ones. That way the field numbers are kept and the new types can have other fields, but both should work in old/new receivers. Does that make sense? |
@joaopgrassi How do you feel about replacing the I have in mind using the |
@jmacd hum.. I'm not sure. If the goal is to have structure and be able to achieve something with it, isn't relying on Just to be clear, I'm not against having more structure and I think all your use cases are valid. I'm just not sure if we have enough knowledge as of know to be able to define such structure, so in this part I agree with @tigrannajaryan where maybe a simple first version is enough. I might be wrong as well. I just think that if we really want more structure, introducing proper types with well-defined properties makes more sense. I think this makes the work of deriving whatever we want from the partial success response easier instead of having to check in a map of things. WDYT? |
If we return a simple string, I think we're not helping the user with their telemetry problem -- the error message just turns one telemetry problem into a new telemetry problem, which has to be reported as an exception for a human to parse. For well-known error conditions that we decide are worthwhile to handle, this does mean parsing the data to access what's inside. However, for a recipient that does not care to perform special checks (or an older version), the structured message is just a simple message. Having structure lets us define semantic conventions later while still conveying error messages now; it's backwards compatible.
I guess the point is that while I'm asking for structure, the data is still just an error message, and the processing I'm imagining is to improve how telemetry-induced error messages are displayed. I would support using new fields for new kinds of information--not error messages--but the message field is just to explain the partial failure--structured or not. |
@jmacd yes, I completely agree with you on this. As I mentioned earlier, I'm all in for a well structured response other than a simple string. What I'm not sure about is (maybe I'm not getting it) if That's why I'm thinking if we shouldn't be considering having a type with well defined properties that will allow senders derive whatever we decide it's important/relevant? e.g. your example with self monitoring metrics. Also:
Yes, this is true for "both approachs" we are discussing - using the |
I was being vague because either approach is OK for me, and I'm trying to point out that this return value is in every respect just a log message. There are optional fields in the LogRecord that I do not imagine being useful for a server-to-client response, this is true. How about using severity to return info (e.g., "info: welcome, your quota is 1000 points/hour") warnings w/o dropping any points ("warning: floating-point-to-integer-conversion is happening")? If the exporter is being traced itself, there is a well-defined span/trace associated with the request and it could be used in the error message. The timestamp could tell you that the server has a lot of clock skew relative to the client. This is important if your error message is that points are not being accepted due to out-of-range timestamps. |
The same is true for a bare string called "error_message". It can hold anything and it can be printed. I think you are reasoning that the client-side logic for say reporting which metrics are experiencing errors will be simpler if we spell it out in Either way, I don't want the structure of the error message to block this work. I care about monitoring the number of dropped metric points much more than the client being able to identify which metric names are being dropped. (After all, we are able to report which metric names are being dropped directly through our SaaS, client's don't need to track this detail.) |
Hum I think I'm a bit confused by this. The server/receiver is the one that will ingest and produce the
Yes. I personally prefer to have explictly defined fields in
The way I see it, old clients/senders cannot use anything we are defining here anyway. Both approaches (having explicit-structure vs string-error message with conventions) requires adding new fields to the
Yes, agreed. That is possible with the current proposal, as we can derive the number of rejected datapoints, spans, logs. As mentioned, I think that is a sensible first start, and we can discuss about the structure in separated issues as I suggested. |
@jmacd @bogdandrutu, @reyang @jack-berg, @tigrannajaryan et al. pinging to see if we have reached a consensus here. My understanding of the current state of discussions is: The current PR is a good start and improves the current status quo because:
But it is not where we want to stay because:
As I suggested above, we could go with this as is as it's already a great improvement from the status quo. Then I could create separated issues to discuss what more complex scenarios we want to handle per signal and then go from there and add a more complex structure to the messages introduced here. What do you all think? |
The PR looks good to me. I already approved. The plan for the future also sounds good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care for the error details field, but the partial success are important.
I created an issue to continue the discussions from here around more structure and next steps for metrics partial success responses. #404. |
Ping @bogdandrutu - waiting for one more review from you ;) |
Looks good! Let's get open-telemetry/opentelemetry-specification#2636 approved and merged first. |
// The number of accepted log records | ||
// | ||
// If the request is partially successful and the number of | ||
// accepted log records is known, the server MUST set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the situation when the number of accepted log records is unknown and why is it useful for the sender to know about this situation specifically? If there is no such situation I suggest that we get rid of this and require the server always sets this number (if it doesn't know then set it either to 0 or to total number - either way is fine). Doing so will eliminate the need to have optional accepted_log_records (optional fields are poorly supported in some Protobuf libraries and are a pain to deal with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the situation when the number of accepted log records is unknown and why is it useful for the sender to know about this situation specifically
I also tend to think this shouldn't happen. If the server doesn't know the number of accepted "things", it should return an empty response, without populating the partial_success
field, just as it is done today.
The discussion around optional
was without it, a server could initialize the partial_success
property, set the error_message
and leave the accepted_<signal>
field with the default value, which is 0
. In this situation, it would not be possible to differentiate a 0
being the the default/unset from the server actually dropping everything.
@jmacd had some other alternatives to this here: #390 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should:
- get rid of the optional for
accepted_<signal>
field - require that the
accepted_<signal>
field value is always set (ifpartial_success
is set). There is no such thing as "unknown" in the response. If it is unknown then the server needs to choose something (0 or equal to request count are both fine choices). The sender doesn't care about this "unknown" state since it can't do anything about it and thus it is unnecessary to signal the "unknown" state in the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require that the accepted_ field value is always set (if partial_success is set)
Yes, I agree and that's what I wrote in the spec now. Although we still have the issue where a sender receives a partial_success
with something in the error message and 0
for accepted_<signal>
and it then must interpret it as "nothing was accepted" which is clear to me, but could be open to other interpretations, since 0
is also the not set/default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see optional
qualifier for accepted_log_records
field. Do you plan to remove it or you disagree with my opinion that we don't need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan I didn't remove the optional
in any for now. I wanted to see what others think before changing it, specially since there is this "nuance" with the default/not set value being the same thing. I know from talking with @jmacd that he had concerns with this double meaning. I think I agree with you that if the spec says to always populate it, if you get a 0
it means nothing was accepted so I'm in favor of removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should: get rid of the optional for accepted_ field require that the accepted_ field value is always set (if partial_success is set). There is no such thing as ...
The problem is that "partial success is set" cannot be answered because without optional
there is no such thing as field presence for messages. If the server responds with a partial_success message having 0 for the accepted_things field and an empty message, the receiver will see an unset partial_success. It should be possible to meaningfully indicate that 0 points were accepted, and we can't rely on field presence without optional
. Otherwise, we need to infer that partial_success was "set" because it has a non-default field.
My goal behind supporting this PR is to get to a place where clients have exact information about how many points are being successfully sent vs. dropped. There should be no ambiguity. Moreover, we should not have to fill in a struct when 100% of points are accepted, because it creates an extra burden on the sender. In order to have a meaningful zero and gain the performance benefit without optional
, we need to change again this PR to count "dropped" things instead of "accepted" things.
I support removing optional
, but all the "accepted_" fields should revert back to "dropped_" fields. The 0 value will always be meaningful, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmacd, I can do the change and submit a new PR if that's what everyone agrees, no problem.
But I wonder, if we will not have the same dilemma again. If the server responds with a partial_success
with rejected_<signal>
= 0
and error_message
= "
what does that mean? Nothing was dropped? Then why did the server even initialized the partial_success
, which it shouldn't? Then we can argue that we will write in the spec that this shouldn't happen but isn't it what is already written. just the other way around?
It should be possible to meaningfully indicate that 0 points were accepted
The current spec change says if everything was accepted, partial_success
should not be set, so I feel it's already enough if you get a partial_success
and accepted_<signal>
is 0
then nothing was accepted. Sure, with optional
there's more certainty but without it, I feel we are also covered, like @tigrannajaryan mentioned.
Moreover, we can't enforce what servers will do, only recommend what it should do. I feel with what we have written in the spec, we already have it covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the server responds with a partial_success with rejected_ = 0 and error_message = " what does that mean? Nothing was dropped? Then why did the server even initialized the partial_success, which it shouldn't?
There is effectively no test that distinguishes "server initialized the partial_success" vs not when the default-value is used, short of using optional
.
Without optional
, there are two choices:
- We either can infer that the number-value is meaningful because the string-value is not empty
- We can declare that the number-value is always meaningful.
We cannot differentiate a field that is set to the default value from a field that was not set. To your question, when the partial success field is not set, as it will not be for older OTLP versions, I will assume nothing dropped. I do not believe there is any other valid interpretation, since we have not yet specified any form of partial success. Until we specify partial success, you have all-or-none semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this is reasonable semantics:
- If
partial_success
is not set then it means the request is processed fully successfully. This also happens to be the old behavior, so no changes in the semantics, which is good - we are backward compatible. - If
partial_success
is set then it means something went wrong. Theaccepted_<signal>
field insideExport*PartialSuccess
message MUST be set. 0 is a valid value foraccepted_<signal>
field and means all records failed to process.error_message
MUST be also set to indicate what the problem was. Empty string happens to be a valid value, meaning that there is no additional error text to provide, however the entire situation is still erroneous.
I think this is all we need, it does not need to be more complicated than this.
How this is used in the Server:
- If the Server processed everything successfully then don't set
partial_success
. - If processing of any record failed the set
partial_success
and calculateaccepted_<signal>
field value astotal records in the record
-failed to process record
. Again, 0 is perfectly fine value to end up with. Also seterror_message
to indicate the failure reason (can be as simple as "some metric data points are invalid" or more complex if desirable, e.g. one line per failed record with precise failure reason).
How this is used in the Client:
- If
partial_success
in the response is not set, good, nothing else to do. - If
partial_success
in the response is set then print an error message in the log and includeerror_message
. Also logaccepted_<signal>
value. Can also keep a metric based onaccepted_<signal>
value if desirable.
I may be missing something here. Perhaps there is a scenario that is not covered here and where the Server or the Client need to make a different decision and behave differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed here, this PR needs to change based on feedback. I think we should close this PR, make the proposed changes to remove optional
and ensure 0 values are meaningful in the specification PR open-telemetry/opentelemetry-specification#2636.
The review thread for this PR is too long to effectively read. Please close and re-open after more progress in open-telemetry/opentelemetry-specification#2636. |
This PR is a first attempt in solving/addressing open-telemetry/opentelemetry-specification#2454.
At first, I tried having "top-level" properties under the
Export*ServiceResponse
types like so:But in my view, this has a drawback. OTLP exporters/senders using the new protos would have to always check the explicit presence of the new fields (e.g.
HasAcceptedDatapoints
) since they don't know which proto version the server is using. Also, since there's more than one field, would it need to check both? What if we add new fields later?Then I thought having a new type that combines these two new fields is a better compromise. This way, it only needs to check if the new field is not null. If it's not null, we can have some guarantee that the other fields are there (we indicate it with a MUST in the spec, for ex)
I did some tests with sample apps to confirm the cases for backwards compatibility: (code here joaopgrassi/otlp-partial-success-experimental).
1. Old exporters working with receivers/servers using the new protos
All works, as the exporters were not expecting the new fields. They will not have the fields, so they can't inform the user of a partial success - status quo.
2. New exporters working with receivers/servers using the old protos
Exporters MUST check if the
Details
field is not null to be able to log a message with the details for troubleshooting.3. New exporters working with receivers/servers using the new protos
Since exporters don't know the proto version servers are sending back, it always have to perform the check on point 2 above.