-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 65 Distributed Tracing #11713
RFD 65 Distributed Tracing #11713
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good. I've found tracing to be an incredibly useful tool in working through performance issues at scale, and it starts moving us towards the Otel ecosystem which is something that I believe will only grow in popularity as it matures over the next year or two.
Whilst there will be plenty of refactoring needed to implement this (e.g properly propagating context across Teleport's codebase), most of this work is really long overdue anyway.
### Non-Goals | ||
|
||
* Adding tracing to the entire codebase all at once | ||
* Replace existing logging, metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note on this, I largely agree with avoiding changing anything re: metrics. My recent experiences with Otel Tracing have been extremely positive, but the Metrics API is still incredibly immature and making a lot of breaking changes, and I've also had a few issues/bugs with their Metrics SDK. We can probably safely revisit the metrics side and Otel in six to twelve months time.
In order to propagate spans that originated from `tctl`, `tsh`, or `tbot` we have two options: | ||
|
||
1) Export the spans directly into the telemetry backend | ||
2) Make the Auth Server a span forwarder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly agree with pursuing the second option here. Exporting spans directly to the telemetry backend raises a lot of issues around preventing abuse and authentication, especially if we want to use this within Teleport cloud.
83c062a
to
014708e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't know how feasible it would be, but I think it would be very nice to have some mechanism to export a trace locally without setting up a cluster-level export destination. I don't know if there is a commonly agreed upon file format for traces that could be exported to, but even just being able to run an export destination locally in docker when running something like tsh ls --trace
would be nice. Would be especially useful if we're trying to assist people in debugging an issue when they don't currently have a cluster-level tracing endpoint enabled.
The underlying otel SDK we use does provide a so-called The other option would be to let the user directly provide an address for a gRPC OTLP TraceService, and to use that instead of the auth server. This would give the most flexibility, as the user could then configure a OpenTelemetry Collector instance to export these spans onto some preferred tool (e.g Jaeger). Whilst neither of these are that complex to implement, I am somewhat dubious of if it'll add much value for a user. In most cases, the value of tracing comes from looking at the span across the distributed system. |
Writing traces to a file has come up a few times, and as @strideynet pointed out it is definitely possible to write OTLP spans to a file in JSON format. However, the file would only contain the spans that are generated by Jaeger actually has built in functionality to allow you to upload such files and it will display the traces as if they were exported there directly. I haven't tested Jaeger to see if it is smart enough to associate spans uploaded from files with spans exported directly to it. Even if that were the case I can't say if other telemetry backends have similar behavior. There is also a high probability that most users won't have access to the telemetry backend that Perhaps just capturing the It would be easy enough to add flags to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things that were discussed with @gozer and @rosstimothy in light of gravitational/cloud#1765
# the url of the exporter to send spans to - can be http/https/grpc | ||
exporter_url: "http://localhost:14268/api/traces" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a way for teleport
itself to use the auth tracing forwarder might be useful.
014708e
to
625df6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a fan of using OpenTelemetry
as it allows for easy export of spans for analysis. Do we want to have someone from the Cloud team take a look given that we'd need to instrument this on Teleport Cloud?
Edit: Cloud has already reviewed this and staging is already ingesting spans.
It doesn't look like @gozer has approved this RFD, but I know he has looked it over in the past. We have been working together to get Cloud to ingest spans, which is currently already happening in staging. In fact, it was his suggestion to tag the forwarded spans so Cloud could potentially omit some spans from their ingestion pipeline. |
Didn't exactly feel appropriate for me to do so, but I certainly could look again and do so.
I sure have
Will be also running in production this week, in theory.
Yes, I remember discussing two distinct possiblilties and factored them in the cloud's design. One was to allow differentiation of client initiated traces ( As an aside, if the plan is to support sending trace data to arbitrary 3rd party destinations, especially by tenant, considerations would have to be made, the current OpenTelemetry Cloud-RDF#0025 had that type of use-case as an explicit Non-Goal. Could certainly be done, but not right away with what we currently have in place. |
48154dd
to
bf6b894
Compare
No description provided.