Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gRPC query service with OTLP model #76

Merged
merged 12 commits into from
Jun 25, 2021

Conversation

pavolloffay
Copy link
Member

Signed-off-by: Pavol Loffay [email protected]

Related to jaegertracing/jaeger#169 (comment)

@pavolloffay
Copy link
Member Author

@jpkrohling @yurishkuro @joe-elliott could you please review?

@joe-elliott
Copy link
Member

Is there a reason to provide a completely new set of calls? Instead we could:

  • standardize/support the existing HTTP API
  • add Accept header support for OTLP proto/json to get trace by id/find traces

option java_package = "io.jaegertracing.api_v3";

message GetTraceRequest {
bytes trace_id = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for REST/JSON API, which representation of the trace ID should we support? Jaeger's base16 or OTEL's base64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect folks to build tools that can ingest our JSONs as if they were OTLP, we should follow their representation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I expect people should be able to copy OTEL traceid (e.g. from logs) and query it directly from Jaeger.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro where did you find OTEL uses base64? The spec mentions hex encoding. The logging exporters use hex as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does OTEL have a spec for JSON format? If that format is rendered from proto, then it will be base64 for bytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, to be honest, this is the reason why I gave up back in the day when trying to make a JSON API backed by proto IDL. I thought OTEL found a solution, but the change in the spec is a total cop out - "it's std proto-JSON except for this field" (which makes std proto-JSON unusable). You mentioned they had prototypes in other languages, how did they solve that?

I am inclined to just support all kinds of formats for IDs in the inputs, i.e. you should be able to paste both base64 and hex ID into the UI. But that doesn't answer what format we return in proto-JSON, and my preference would be to stick with the standard proto-JSON for that, meaning returning base64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned they had prototypes in other languages, how did they solve that?

custom codes, the difficulty depends on the language so not ideal for the consumers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we go with the streaming API for the get trace(s) the JSONPb codec will not work OOTB - see #76 (comment). The returned object is wrapped into another object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the upstream issue for the reference grpc-ecosystem/grpc-gateway#1254 (comment) and it's apparently not fixed in v2 (I have asked on their slack).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have switched impl to use base64 for embedded IDs and keep using hex for queries.

@pavolloffay
Copy link
Member Author

Is there a reason to provide a completely new set of calls? Instead we could:

standardize/support the existing HTTP API
add Accept header support for OTLP proto/json to get trace by id/find traces

This is a great question! Technically doable (maybe a bit messier) and provides the same features. One downside is that people will keep using the old API bc the new one is "hidden" behind the accept header.

@joe-elliott
Copy link
Member

One downside is that people will keep using the old API bc the new one is "hidden" behind the accept header.

This is true. Naturally we could document the list of Accept values we support, but I don't disagree it would be harder to find. Is the point of the grpc service to be service to service? Do we expect to migrate the web frontend to use this api (through some kind of http proxy)? Apologies, but I may have missed some discussion about the goals.

Signed-off-by: Pavol Loffay <[email protected]>
proto/api_v3/query_service.proto Outdated Show resolved Hide resolved
proto/api_v3/query_service.proto Outdated Show resolved Hide resolved
}

message SpansResponseChunk {
repeated opentelemetry.proto.trace.v1.ResourceSpans resource_spans = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a resource (maybe this)?

is it possible for jaeger-query to return spans from more than one resource? If so, for my learning, what are some examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky to construct this from Jaeger spans because we denormalize the Process into each individual span. We do have some re-assembly logic when we return spans to the UI, but it's probably also valid to just return (resource, span) pairs as denormalized.

Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay
Copy link
Member Author

Added comments and renamed tags to attributes and search depth to num_traces.

Signed-off-by: Pavol Loffay <[email protected]>
rpc GetTrace(GetTraceRequest) returns (stream SpansResponseChunk) {}

// GetTraces searches for traces.
rpc GetTraces(FindTracesRequest) returns (stream SpansResponseChunk) {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro do you remember why streaming was used? An alternative would be to return a list of chunks. Also we could remove chunk to Trace which is more idiomatic.

One issue with the streaming and grpc-gateway is that it wraps the response into result - e.g. {result: {resource_spans: ...}} see https://github.com/jaegertracing/jaeger/pull/3086/files#diff-1429f7cc5a76981a44799039e43d3bc7372808373b9c2b97a333c7dcf650b00aR72

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming - because large result sets are difficult to transmit as one response.

Returning chunks manually means implementing some kind of pagination API, which would require support in the storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one way or another to fully support streaming the storage API would have to change anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage API already has FindTraceIDs, which allows the query service to load traces one by one

Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay
Copy link
Member Author

@yurishkuro @joe-elliott @albertteoh

Is there any blocker for this PR? I would like to move it forward. To sum up, the query API exposes OTLP traces. The REST API is done via grpc-gateway with base64 encoding for embedded ids. The ids in the query parameters are hex encoded.

OTEL mandates to use hex encoding for ids in the JSON, however it requires a custom codec which is not easy to implement, hence the result will be hard to consume. To keep the compatibility with JSONPb codec we will expose ids in base64 (alternatively we could have a flag/param to set the encoding).

string trace_id = 1;
}

// A single response chunk holds a single trace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v2 API this description would be inaccurate - a chunk is neither a full trace nor spans from a single trace. A chuck is just a mechanism of delivering large number of spans in smaller batches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this hold in practice?

To my knowledge, a chunk at the moment always represents a single trace (a trace known at the query time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have slightly changed the comment

message TraceQueryParameters {
string service_name = 1;
string operation_name = 2;
map<string, string> attributes = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify which attributes these are supposed to match? We've been pretty loose about it in the original API, leaving the interpretation to the storage. I.e. should all these attributes match on a single span, or could they match across spans? Do they match span attributes only or span logs as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The match should be done on tags and process tags. ES supports match on logs as well when kibana support is enabled (flat schema).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ES they must match on a single span. How is it done in cassandra? I think all storages should follow this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true in Cassandra, because it takes the tag=value string and looks up an index that just gives trace IDs. If more than one tag is provided, it could easily match on different spans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment on the TraceQueryParameters that clarifies that some storage implementations might deviate.

Alright so C* does deviate as well. What was the original design for attributes? Match any attributes within trace or in a single span?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to see this: what would you expect as a user? If you specify two pairs of attributes, would you expect them to exist throughout the trace, or for all attributes to exist as part of the same span? I'm not quite sure I have an answer here... Here's one case advocating for attributes to exist throughout the trace:

  • root span has "userID=123", no other spans contain this
  • span (not root span) has error=true
  • SRE is looking for traces for user 123 with errors in it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to create an issue to address this, this PR can be merged as is.

google.protobuf.Timestamp start_time_max = 5;
google.protobuf.Duration duration_min = 6;
google.protobuf.Duration duration_max = 7;
int32 num_traces = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this as num_traces or use a more vague term like "search depth"? Because Cassandra storage does not guarantee num_traces in the response, which was often a source of confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use num_traces it is what the query API uses. If C* does not implement it (or any other storage) we should document that.

The goal is to make it clear for users what the parameter means.

Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay
Copy link
Member Author

PR updated

proto/api_v3/query_service.proto Outdated Show resolved Hide resolved
string trace_id = 1;
}

// A single response chunk holds spans from a single trace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, a chunk at the moment always represents a single trace (a trace known at the query time).

I think this is not a matter of how it is currently implemented, but what we want to guarantee. The intention of the original API was to NOT have this guarantee, i.e. the service is allowed to mix spans from different traces in a single chunk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should guarantee it for these reasons:

  • the current API does not mix spans from different traces
  • the API that does not mix spans from different traces is easier to consume (e.g. for typical use-cases that we know)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this guarantee in the API. The streaming & chunking API was primarily introduced for efficiency, but we're taking away server's ability to optimize its response. You could be loading a ton of small traces, e.g. 2 spans each, so this guarantee in the API would force the server to send tiny chunks, which is going to be suboptimal. On the other hand, there is a max chunk size in the server so there is always a possibility that a large trace will be split across several chunks, that largely takes away your "easier to consume" reason #2.

We can always introduce this guarantee later, but removing it would backwards incompatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the 5 years of the project history we haven't used the feature of streaming chunks with mixed traces. I don't think any other DT system behaves this way.

You could be loading a ton of small traces, e.g. 2 spans each, so this guarantee in the API would force the server to send tiny chunks, which is going to be suboptimal.

This is how Jaeger works right now and we haven't see any complaints/issues/use-case to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always introduce this guarantee later, but removing it would backwards incompatible.

It would be a breaking change one way or another. The current consumers do not expect that spans in chunks are mixed (e.g. UI does not).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this per my comments above but I have removed the guarantee of not mixing spans in one chunk per Yuri's request.

proto/api_v3/query_service.proto Outdated Show resolved Hide resolved
proto/api_v3/query_service.proto Outdated Show resolved Hide resolved
message TraceQueryParameters {
string service_name = 1;
string operation_name = 2;
map<string, string> attributes = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true in Cassandra, because it takes the tag=value string and looks up an index that just gives trace IDs. If more than one tag is provided, it could easily match on different spans.

proto/api_v3/query_service.proto Show resolved Hide resolved
proto/api_v3/query_service.proto Outdated Show resolved Hide resolved
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
message TraceQueryParameters {
string service_name = 1;
string operation_name = 2;
map<string, string> attributes = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to see this: what would you expect as a user? If you specify two pairs of attributes, would you expect them to exist throughout the trace, or for all attributes to exist as part of the same span? I'm not quite sure I have an answer here... Here's one case advocating for attributes to exist throughout the trace:

  • root span has "userID=123", no other spans contain this
  • span (not root span) has error=true
  • SRE is looking for traces for user 123 with errors in it

proto/api_v3/query_service.proto Show resolved Hide resolved
proto/api_v3/query_service_http.yaml Outdated Show resolved Hide resolved
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay
Copy link
Member Author

@yurishkuro are there any blockers on your side for this PR?

The biggest shady point in the PR is the definition of TraceQueryParameters parameters. I have added a comment

// Note that some storage implementations do not guarantee the correct implementation of all parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants