-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gRPC query service with OTLP model #76
Conversation
Signed-off-by: Pavol Loffay <[email protected]>
@jpkrohling @yurishkuro @joe-elliott could you please review? |
Is there a reason to provide a completely new set of calls? Instead we could:
|
proto/api_v3/query_service.proto
Outdated
option java_package = "io.jaegertracing.api_v3"; | ||
|
||
message GetTraceRequest { | ||
bytes trace_id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for REST/JSON API, which representation of the trace ID should we support? Jaeger's base16 or OTEL's base64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect folks to build tools that can ingest our JSONs as if they were OTLP, we should follow their representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I expect people should be able to copy OTEL traceid (e.g. from logs) and query it directly from Jaeger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro where did you find OTEL uses base64? The spec mentions hex encoding. The logging exporters use hex as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does OTEL have a spec for JSON format? If that format is rendered from proto, then it will be base64 for bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, to be honest, this is the reason why I gave up back in the day when trying to make a JSON API backed by proto IDL. I thought OTEL found a solution, but the change in the spec is a total cop out - "it's std proto-JSON except for this field" (which makes std proto-JSON unusable). You mentioned they had prototypes in other languages, how did they solve that?
I am inclined to just support all kinds of formats for IDs in the inputs, i.e. you should be able to paste both base64 and hex ID into the UI. But that doesn't answer what format we return in proto-JSON, and my preference would be to stick with the standard proto-JSON for that, meaning returning base64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned they had prototypes in other languages, how did they solve that?
custom codes, the difficulty depends on the language so not ideal for the consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we go with the streaming API for the get trace(s) the JSONPb codec will not work OOTB - see #76 (comment). The returned object is wrapped into another object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the upstream issue for the reference grpc-ecosystem/grpc-gateway#1254 (comment) and it's apparently not fixed in v2 (I have asked on their slack).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have switched impl to use base64 for embedded IDs and keep using hex for queries.
This is a great question! Technically doable (maybe a bit messier) and provides the same features. One downside is that people will keep using the old API bc the new one is "hidden" behind the accept header. |
This is true. Naturally we could document the list of |
Signed-off-by: Pavol Loffay <[email protected]>
} | ||
|
||
message SpansResponseChunk { | ||
repeated opentelemetry.proto.trace.v1.ResourceSpans resource_spans = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a resource (maybe this)?
is it possible for jaeger-query to return spans from more than one resource? If so, for my learning, what are some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTEL Resource
is similar to Jaeger Process
object see https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/resource/v1/resource.proto#L27.
Here the returned object is a list of https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit tricky to construct this from Jaeger spans because we denormalize the Process into each individual span. We do have some re-assembly logic when we return spans to the UI, but it's probably also valid to just return (resource, span) pairs as denormalized.
Signed-off-by: Pavol Loffay <[email protected]>
Added comments and renamed tags to attributes and search depth to num_traces. |
Signed-off-by: Pavol Loffay <[email protected]>
proto/api_v3/query_service.proto
Outdated
rpc GetTrace(GetTraceRequest) returns (stream SpansResponseChunk) {} | ||
|
||
// GetTraces searches for traces. | ||
rpc GetTraces(FindTracesRequest) returns (stream SpansResponseChunk) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro do you remember why streaming was used? An alternative would be to return a list of chunks. Also we could remove chunk to Trace
which is more idiomatic.
One issue with the streaming and grpc-gateway is that it wraps the response into result - e.g. {result: {resource_spans: ...}}
see https://github.com/jaegertracing/jaeger/pull/3086/files#diff-1429f7cc5a76981a44799039e43d3bc7372808373b9c2b97a333c7dcf650b00aR72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming - because large result sets are difficult to transmit as one response.
Returning chunks manually means implementing some kind of pagination API, which would require support in the storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one way or another to fully support streaming the storage API would have to change anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage API already has FindTraceIDs
, which allows the query service to load traces one by one
Signed-off-by: Pavol Loffay <[email protected]>
@yurishkuro @joe-elliott @albertteoh Is there any blocker for this PR? I would like to move it forward. To sum up, the query API exposes OTLP traces. The REST API is done via grpc-gateway with base64 encoding for embedded ids. The ids in the query parameters are hex encoded. OTEL mandates to use hex encoding for ids in the JSON, however it requires a custom codec which is not easy to implement, hence the result will be hard to consume. To keep the compatibility with JSONPb codec we will expose ids in base64 (alternatively we could have a flag/param to set the encoding). |
proto/api_v3/query_service.proto
Outdated
string trace_id = 1; | ||
} | ||
|
||
// A single response chunk holds a single trace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v2 API this description would be inaccurate - a chunk is neither a full trace nor spans from a single trace. A chuck is just a mechanism of delivering large number of spans in smaller batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this hold in practice?
To my knowledge, a chunk at the moment always represents a single trace (a trace known at the query time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have slightly changed the comment
message TraceQueryParameters { | ||
string service_name = 1; | ||
string operation_name = 2; | ||
map<string, string> attributes = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we clarify which attributes these are supposed to match? We've been pretty loose about it in the original API, leaving the interpretation to the storage. I.e. should all these attributes match on a single span, or could they match across spans? Do they match span attributes only or span logs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The match should be done on tags and process tags. ES supports match on logs as well when kibana support is enabled (flat schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ES they must match on a single span. How is it done in cassandra? I think all storages should follow this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true in Cassandra, because it takes the tag=value string and looks up an index that just gives trace IDs. If more than one tag is provided, it could easily match on different spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment on the TraceQueryParameters
that clarifies that some storage implementations might deviate.
Alright so C* does deviate as well. What was the original design for attributes? Match any attributes within trace or in a single span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to see this: what would you expect as a user? If you specify two pairs of attributes, would you expect them to exist throughout the trace, or for all attributes to exist as part of the same span? I'm not quite sure I have an answer here... Here's one case advocating for attributes to exist throughout the trace:
- root span has "userID=123", no other spans contain this
- span (not root span) has error=true
- SRE is looking for traces for user 123 with errors in it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to create an issue to address this, this PR can be merged as is.
google.protobuf.Timestamp start_time_max = 5; | ||
google.protobuf.Duration duration_min = 6; | ||
google.protobuf.Duration duration_max = 7; | ||
int32 num_traces = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this as num_traces
or use a more vague term like "search depth"? Because Cassandra storage does not guarantee num_traces in the response, which was often a source of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use num_traces
it is what the query API uses. If C* does not implement it (or any other storage) we should document that.
The goal is to make it clear for users what the parameter means.
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
PR updated |
proto/api_v3/query_service.proto
Outdated
string trace_id = 1; | ||
} | ||
|
||
// A single response chunk holds spans from a single trace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, a chunk at the moment always represents a single trace (a trace known at the query time).
I think this is not a matter of how it is currently implemented, but what we want to guarantee. The intention of the original API was to NOT have this guarantee, i.e. the service is allowed to mix spans from different traces in a single chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should guarantee it for these reasons:
- the current API does not mix spans from different traces
- the API that does not mix spans from different traces is easier to consume (e.g. for typical use-cases that we know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like this guarantee in the API. The streaming & chunking API was primarily introduced for efficiency, but we're taking away server's ability to optimize its response. You could be loading a ton of small traces, e.g. 2 spans each, so this guarantee in the API would force the server to send tiny chunks, which is going to be suboptimal. On the other hand, there is a max chunk size in the server so there is always a possibility that a large trace will be split across several chunks, that largely takes away your "easier to consume" reason #2.
We can always introduce this guarantee later, but removing it would backwards incompatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the 5 years of the project history we haven't used the feature of streaming chunks with mixed traces. I don't think any other DT system behaves this way.
You could be loading a ton of small traces, e.g. 2 spans each, so this guarantee in the API would force the server to send tiny chunks, which is going to be suboptimal.
This is how Jaeger works right now and we haven't see any complaints/issues/use-case to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always introduce this guarantee later, but removing it would backwards incompatible.
It would be a breaking change one way or another. The current consumers do not expect that spans in chunks are mixed (e.g. UI does not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this per my comments above but I have removed the guarantee of not mixing spans in one chunk per Yuri's request.
message TraceQueryParameters { | ||
string service_name = 1; | ||
string operation_name = 2; | ||
map<string, string> attributes = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true in Cassandra, because it takes the tag=value string and looks up an index that just gives trace IDs. If more than one tag is provided, it could easily match on different spans.
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
message TraceQueryParameters { | ||
string service_name = 1; | ||
string operation_name = 2; | ||
map<string, string> attributes = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to see this: what would you expect as a user? If you specify two pairs of attributes, would you expect them to exist throughout the trace, or for all attributes to exist as part of the same span? I'm not quite sure I have an answer here... Here's one case advocating for attributes to exist throughout the trace:
- root span has "userID=123", no other spans contain this
- span (not root span) has error=true
- SRE is looking for traces for user 123 with errors in it
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
@yurishkuro are there any blockers on your side for this PR? The biggest shady point in the PR is the definition of
|
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay [email protected]
Related to jaegertracing/jaeger#169 (comment)