Add gRPC query service with OTLP model #76

pavolloffay · 2021-06-10T15:57:18Z

Signed-off-by: Pavol Loffay [email protected]

Related to jaegertracing/jaeger#169 (comment)

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay · 2021-06-10T15:57:40Z

@jpkrohling @yurishkuro @joe-elliott could you please review?

joe-elliott · 2021-06-10T19:39:08Z

Is there a reason to provide a completely new set of calls? Instead we could:

standardize/support the existing HTTP API
add Accept header support for OTLP proto/json to get trace by id/find traces

yurishkuro · 2021-06-10T23:40:28Z

proto/api_v3/query_service.proto

+option java_package = "io.jaegertracing.api_v3";
+
+message GetTraceRequest {
+  bytes trace_id = 1;


for REST/JSON API, which representation of the trace ID should we support? Jaeger's base16 or OTEL's base64?

If we expect folks to build tools that can ingest our JSONs as if they were OTLP, we should follow their representation.

good point. I expect people should be able to copy OTEL traceid (e.g. from logs) and query it directly from Jaeger.

@yurishkuro where did you find OTEL uses base64? The spec mentions hex encoding. The logging exporters use hex as well.

Does OTEL have a spec for JSON format? If that format is rendered from proto, then it will be base64 for bytes

Well, to be honest, this is the reason why I gave up back in the day when trying to make a JSON API backed by proto IDL. I thought OTEL found a solution, but the change in the spec is a total cop out - "it's std proto-JSON except for this field" (which makes std proto-JSON unusable). You mentioned they had prototypes in other languages, how did they solve that?

I am inclined to just support all kinds of formats for IDs in the inputs, i.e. you should be able to paste both base64 and hex ID into the UI. But that doesn't answer what format we return in proto-JSON, and my preference would be to stick with the standard proto-JSON for that, meaning returning base64.

You mentioned they had prototypes in other languages, how did they solve that?

custom codes, the difficulty depends on the language so not ideal for the consumers.

Also if we go with the streaming API for the get trace(s) the JSONPb codec will not work OOTB - see #76 (comment). The returned object is wrapped into another object.

Here is the upstream issue for the reference grpc-ecosystem/grpc-gateway#1254 (comment) and it's apparently not fixed in v2 (I have asked on their slack).

I have switched impl to use base64 for embedded IDs and keep using hex for queries.

pavolloffay · 2021-06-11T09:30:58Z

Is there a reason to provide a completely new set of calls? Instead we could:

standardize/support the existing HTTP API
add Accept header support for OTLP proto/json to get trace by id/find traces

This is a great question! Technically doable (maybe a bit messier) and provides the same features. One downside is that people will keep using the old API bc the new one is "hidden" behind the accept header.

joe-elliott · 2021-06-11T15:43:01Z

One downside is that people will keep using the old API bc the new one is "hidden" behind the accept header.

This is true. Naturally we could document the list of Accept values we support, but I don't disagree it would be harder to find. Is the point of the grpc service to be service to service? Do we expect to migrate the web frontend to use this api (through some kind of http proxy)? Apologies, but I may have missed some discussion about the goals.

Signed-off-by: Pavol Loffay <[email protected]>

proto/api_v3/query_service.proto

albertteoh · 2021-06-16T11:26:03Z

proto/api_v3/query_service.proto

+}
+
+message SpansResponseChunk {
+  repeated opentelemetry.proto.trace.v1.ResourceSpans resource_spans = 1;


what is a resource (maybe this)?

is it possible for jaeger-query to return spans from more than one resource? If so, for my learning, what are some examples?

OTEL Resource is similar to Jaeger Process object see https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/resource/v1/resource.proto#L27.

Here the returned object is a list of https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto#L28

It's a bit tricky to construct this from Jaeger spans because we denormalize the Process into each individual span. We do have some re-assembly logic when we return spans to the UI, but it's probably also valid to just return (resource, span) pairs as denormalized.

proto/api_v3/query_service.proto

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay · 2021-06-16T13:13:05Z

Added comments and renamed tags to attributes and search depth to num_traces.

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay · 2021-06-16T15:19:08Z

proto/api_v3/query_service.proto

+  rpc GetTrace(GetTraceRequest) returns (stream SpansResponseChunk) {}
+
+  // GetTraces searches for traces.
+  rpc GetTraces(FindTracesRequest) returns (stream SpansResponseChunk) {}


@yurishkuro do you remember why streaming was used? An alternative would be to return a list of chunks. Also we could remove chunk to Trace which is more idiomatic.

One issue with the streaming and grpc-gateway is that it wraps the response into result - e.g. {result: {resource_spans: ...}} see https://github.com/jaegertracing/jaeger/pull/3086/files#diff-1429f7cc5a76981a44799039e43d3bc7372808373b9c2b97a333c7dcf650b00aR72

Streaming - because large result sets are difficult to transmit as one response.

Returning chunks manually means implementing some kind of pagination API, which would require support in the storage.

one way or another to fully support streaming the storage API would have to change anyways.

storage API already has FindTraceIDs, which allows the query service to load traces one by one

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay · 2021-06-17T12:30:52Z

@yurishkuro @joe-elliott @albertteoh

Is there any blocker for this PR? I would like to move it forward. To sum up, the query API exposes OTLP traces. The REST API is done via grpc-gateway with base64 encoding for embedded ids. The ids in the query parameters are hex encoded.

OTEL mandates to use hex encoding for ids in the JSON, however it requires a custom codec which is not easy to implement, hence the result will be hard to consume. To keep the compatibility with JSONPb codec we will expose ids in base64 (alternatively we could have a flag/param to set the encoding).

yurishkuro · 2021-06-17T17:25:44Z

proto/api_v3/query_service.proto

+  string trace_id = 1;
+}
+
+// A single response chunk holds a single trace.


In v2 API this description would be inaccurate - a chunk is neither a full trace nor spans from a single trace. A chuck is just a mechanism of delivering large number of spans in smaller batches.

Does this hold in practice?

To my knowledge, a chunk at the moment always represents a single trace (a trace known at the query time).

I have slightly changed the comment

yurishkuro · 2021-06-17T18:15:49Z

proto/api_v3/query_service.proto

+message TraceQueryParameters {
+  string service_name = 1;
+  string operation_name = 2;
+  map<string, string> attributes = 3;


Should we clarify which attributes these are supposed to match? We've been pretty loose about it in the original API, leaving the interpretation to the storage. I.e. should all these attributes match on a single span, or could they match across spans? Do they match span attributes only or span logs as well?

The match should be done on tags and process tags. ES supports match on logs as well when kibana support is enabled (flat schema).

In ES they must match on a single span. How is it done in cassandra? I think all storages should follow this.

I don't think this is true in Cassandra, because it takes the tag=value string and looks up an index that just gives trace IDs. If more than one tag is provided, it could easily match on different spans.

I have added a comment on the TraceQueryParameters that clarifies that some storage implementations might deviate.

Alright so C* does deviate as well. What was the original design for attributes? Match any attributes within trace or in a single span?

Another way to see this: what would you expect as a user? If you specify two pairs of attributes, would you expect them to exist throughout the trace, or for all attributes to exist as part of the same span? I'm not quite sure I have an answer here... Here's one case advocating for attributes to exist throughout the trace:

root span has "userID=123", no other spans contain this

span (not root span) has error=true

SRE is looking for traces for user 123 with errors in it

If you want to create an issue to address this, this PR can be merged as is.

yurishkuro · 2021-06-17T18:17:01Z

proto/api_v3/query_service.proto

+  google.protobuf.Timestamp start_time_max = 5;
+  google.protobuf.Duration duration_min = 6;
+  google.protobuf.Duration duration_max = 7;
+  int32 num_traces = 8;


Do we want to keep this as num_traces or use a more vague term like "search depth"? Because Cassandra storage does not guarantee num_traces in the response, which was often a source of confusion.

I would prefer to use num_traces it is what the query API uses. If C* does not implement it (or any other storage) we should document that.

The goal is to make it clear for users what the parameter means.

proto/api_v3/query_service.proto

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay · 2021-06-18T09:23:40Z

PR updated

proto/api_v3/query_service.proto

yurishkuro · 2021-06-18T15:30:39Z

proto/api_v3/query_service.proto

+  string trace_id = 1;
+}
+
+// A single response chunk holds spans from a single trace.


To my knowledge, a chunk at the moment always represents a single trace (a trace known at the query time).

I think this is not a matter of how it is currently implemented, but what we want to guarantee. The intention of the original API was to NOT have this guarantee, i.e. the service is allowed to mix spans from different traces in a single chunk.

I think we should guarantee it for these reasons:

the current API does not mix spans from different traces

the API that does not mix spans from different traces is easier to consume (e.g. for typical use-cases that we know)

I still don't like this guarantee in the API. The streaming & chunking API was primarily introduced for efficiency, but we're taking away server's ability to optimize its response. You could be loading a ton of small traces, e.g. 2 spans each, so this guarantee in the API would force the server to send tiny chunks, which is going to be suboptimal. On the other hand, there is a max chunk size in the server so there is always a possibility that a large trace will be split across several chunks, that largely takes away your "easier to consume" reason #2.

We can always introduce this guarantee later, but removing it would backwards incompatible.

In the 5 years of the project history we haven't used the feature of streaming chunks with mixed traces. I don't think any other DT system behaves this way.

You could be loading a ton of small traces, e.g. 2 spans each, so this guarantee in the API would force the server to send tiny chunks, which is going to be suboptimal.

This is how Jaeger works right now and we haven't see any complaints/issues/use-case to change it.

We can always introduce this guarantee later, but removing it would backwards incompatible.

It would be a breaking change one way or another. The current consumers do not expect that spans in chunks are mixed (e.g. UI does not).

I don't agree with this per my comments above but I have removed the guarantee of not mixing spans in one chunk per Yuri's request.

proto/api_v3/query_service.proto

yurishkuro · 2021-06-18T15:36:06Z

proto/api_v3/query_service.proto

+message TraceQueryParameters {
+  string service_name = 1;
+  string operation_name = 2;
+  map<string, string> attributes = 3;


I don't think this is true in Cassandra, because it takes the tag=value string and looks up an index that just gives trace IDs. If more than one tag is provided, it could easily match on different spans.

proto/api_v3/query_service.proto

Signed-off-by: Pavol Loffay <[email protected]>

jpkrohling · 2021-06-24T09:13:12Z

proto/api_v3/query_service.proto

+message TraceQueryParameters {
+  string service_name = 1;
+  string operation_name = 2;
+  map<string, string> attributes = 3;


Another way to see this: what would you expect as a user? If you specify two pairs of attributes, would you expect them to exist throughout the trace, or for all attributes to exist as part of the same span? I'm not quite sure I have an answer here... Here's one case advocating for attributes to exist throughout the trace:

root span has "userID=123", no other spans contain this

span (not root span) has error=true

SRE is looking for traces for user 123 with errors in it

proto/api_v3/query_service.proto

proto/api_v3/query_service_http.yaml

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay · 2021-06-24T16:03:50Z

@yurishkuro are there any blockers on your side for this PR?

The biggest shady point in the PR is the definition of TraceQueryParameters parameters. I have added a comment

// Note that some storage implementations do not guarantee the correct implementation of all parameters.

Signed-off-by: Pavol Loffay <[email protected]>

Add gRPC query service with OTLP model

eee874e

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay mentioned this pull request Jun 10, 2021

Add query service with OTLP jaegertracing/jaeger#3086

Merged

2 tasks

yurishkuro reviewed Jun 10, 2021

View reviewed changes

Change traceid to string

0bb4f36

Signed-off-by: Pavol Loffay <[email protected]>

albertteoh reviewed Jun 16, 2021

View reviewed changes

joe-elliott reviewed Jun 16, 2021

View reviewed changes

proto/api_v3/query_service.proto Show resolved Hide resolved

Add comments

1628cfc

Signed-off-by: Pavol Loffay <[email protected]>

Add comment

70ec072

Signed-off-by: Pavol Loffay <[email protected]>

pavolloffay commented Jun 16, 2021

View reviewed changes

More comments

889c26d

Signed-off-by: Pavol Loffay <[email protected]>

yurishkuro reviewed Jun 17, 2021

View reviewed changes

proto/api_v3/query_service.proto Show resolved Hide resolved

yurishkuro reviewed Jun 17, 2021

View reviewed changes

proto/api_v3/query_service.proto Show resolved Hide resolved

yurishkuro reviewed Jun 17, 2021

View reviewed changes

proto/api_v3/query_service.proto Show resolved Hide resolved

yurishkuro reviewed Jun 17, 2021

View reviewed changes

proto/api_v3/query_service.proto Outdated Show resolved Hide resolved

pavolloffay added 2 commits June 18, 2021 10:20

Add more comments

8ab227e

Signed-off-by: Pavol Loffay <[email protected]>

more comments

00c871a

Signed-off-by: Pavol Loffay <[email protected]>

yurishkuro reviewed Jun 18, 2021

View reviewed changes

pavolloffay added 2 commits June 21, 2021 11:58

Fix review comments

a208c8a

Signed-off-by: Pavol Loffay <[email protected]>

small nits

175506b

Signed-off-by: Pavol Loffay <[email protected]>

jpkrohling reviewed Jun 24, 2021

View reviewed changes

pavolloffay added 2 commits June 24, 2021 17:47

fix comment

69cfc10

Signed-off-by: Pavol Loffay <[email protected]>

the

2300184

Signed-off-by: Pavol Loffay <[email protected]>

Remove guarantee that one response object cannot mix traces

4565ddd

Signed-off-by: Pavol Loffay <[email protected]>

yurishkuro approved these changes Jun 25, 2021

View reviewed changes

pavolloffay merged commit 793dc1e into jaegertracing:master Jun 25, 2021

pavolloffay mentioned this pull request Jun 25, 2021

Clarify attributes support in API_v3 request query parameters to find traces jaegertracing/jaeger#3108

Closed

yvrhdn mentioned this pull request Nov 29, 2021

Tempo support kiali/kiali#4278

Closed

Add gRPC query service with OTLP model #76

Add gRPC query service with OTLP model #76

Conversation

pavolloffay commented Jun 10, 2021

pavolloffay commented Jun 10, 2021

joe-elliott commented Jun 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Jun 11, 2021

joe-elliott commented Jun 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Jun 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Jun 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Jun 24, 2021