Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 65 Distributed Tracing #11713

Merged
merged 4 commits into from
Aug 4, 2022
Merged

RFD 65 Distributed Tracing #11713

merged 4 commits into from
Aug 4, 2022

Conversation

rosstimothy
Copy link
Contributor

No description provided.

@rosstimothy rosstimothy added the rfd Request for Discussion label Apr 4, 2022
@github-actions github-actions bot requested review from jakule and timothyb89 April 4, 2022 20:29
rfd/0065-distributed-tracing.md Outdated Show resolved Hide resolved
rfd/0065-distributed-tracing.md Show resolved Hide resolved
rfd/0065-distributed-tracing.md Show resolved Hide resolved
rfd/0065-distributed-tracing.md Show resolved Hide resolved
@strideynet strideynet self-requested a review June 1, 2022 13:12
Copy link
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. I've found tracing to be an incredibly useful tool in working through performance issues at scale, and it starts moving us towards the Otel ecosystem which is something that I believe will only grow in popularity as it matures over the next year or two.

Whilst there will be plenty of refactoring needed to implement this (e.g properly propagating context across Teleport's codebase), most of this work is really long overdue anyway.

### Non-Goals

* Adding tracing to the entire codebase all at once
* Replace existing logging, metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note on this, I largely agree with avoiding changing anything re: metrics. My recent experiences with Otel Tracing have been extremely positive, but the Metrics API is still incredibly immature and making a lot of breaking changes, and I've also had a few issues/bugs with their Metrics SDK. We can probably safely revisit the metrics side and Otel in six to twelve months time.

In order to propagate spans that originated from `tctl`, `tsh`, or `tbot` we have two options:

1) Export the spans directly into the telemetry backend
2) Make the Auth Server a span forwarder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree with pursuing the second option here. Exporting spans directly to the telemetry backend raises a lot of issues around preventing abuse and authentication, especially if we want to use this within Teleport cloud.

rfd/0065-distributed-tracing.md Show resolved Hide resolved
@rosstimothy rosstimothy force-pushed the rfd/0065-distributed-tracing branch from 83c062a to 014708e Compare June 8, 2022 14:44
@espadolini espadolini self-requested a review June 14, 2022 12:07
Copy link
Contributor

@fspmarshall fspmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't know how feasible it would be, but I think it would be very nice to have some mechanism to export a trace locally without setting up a cluster-level export destination. I don't know if there is a commonly agreed upon file format for traces that could be exported to, but even just being able to run an export destination locally in docker when running something like tsh ls --trace would be nice. Would be especially useful if we're trying to assist people in debugging an issue when they don't currently have a cluster-level tracing endpoint enabled.

@strideynet
Copy link
Contributor

I don't know how feasible it would be, but I think it would be very nice to have some mechanism to export a trace locally without setting up a cluster-level export destination. I don't know if there is a commonly agreed upon file format for traces that could be exported to, but even just being able to run an export destination locally in docker when running something like tsh ls --trace would be nice.

The underlying otel SDK we use does provide a so-called stdouttrace exporter, which rather than submitting the spans to a gRPC OTLP TraceService (as we currently use), writes the spans as JSON to an io.Writer. It would certainly be feasible to have a flag that would enable this behaviour and write to a file instead. This is probably the closest you could get to a commonly agreed format for this, but I'm unsure of what tooling actually exists for usefully interpreting these files.

The other option would be to let the user directly provide an address for a gRPC OTLP TraceService, and to use that instead of the auth server. This would give the most flexibility, as the user could then configure a OpenTelemetry Collector instance to export these spans onto some preferred tool (e.g Jaeger).

Whilst neither of these are that complex to implement, I am somewhat dubious of if it'll add much value for a user. In most cases, the value of tracing comes from looking at the span across the distributed system.

@rosstimothy
Copy link
Contributor Author

I don't know how feasible it would be, but I think it would be very nice to have some mechanism to export a trace locally without setting up a cluster-level export destination. I don't know if there is a commonly agreed upon file format for traces that could be exported to, but even just being able to run an export destination locally in docker when running something like tsh ls --trace would be nice.

The underlying otel SDK we use does provide a so-called stdouttrace exporter, which rather than submitting the spans to a gRPC OTLP TraceService (as we currently use), writes the spans as JSON to an io.Writer. It would certainly be feasible to have a flag that would enable this behaviour and write to a file instead. This is probably the closest you could get to a commonly agreed format for this, but I'm unsure of what tooling actually exists for usefully interpreting these files.

The other option would be to let the user directly provide an address for a gRPC OTLP TraceService, and to use that instead of the auth server. This would give the most flexibility, as the user could then configure a OpenTelemetry Collector instance to export these spans onto some preferred tool (e.g Jaeger).

Whilst neither of these are that complex to implement, I am somewhat dubious of if it'll add much value for a user. In most cases, the value of tracing comes from looking at the span across the distributed system.

Writing traces to a file has come up a few times, and as @strideynet pointed out it is definitely possible to write OTLP spans to a file in JSON format. However, the file would only contain the spans that are generated by tsh. All spans created by teleport would still be written to the exporter_url defined in it's tracing_service config.

Jaeger actually has built in functionality to allow you to upload such files and it will display the traces as if they were exported there directly. I haven't tested Jaeger to see if it is smart enough to associate spans uploaded from files with spans exported directly to it. Even if that were the case I can't say if other telemetry backends have similar behavior. There is also a high probability that most users won't have access to the telemetry backend that teleport is shipping it traces to, so maybe that's a moot point.

image

Perhaps just capturing the tsh spans is enough to give an idea where latency is coming from. But it certainly wouldn't paint the entire picture that forwarding spans does. That being said I do think that there is some utility to writing traces to files. Allowing a file based exporter_url might be the lowest barrier to entry for people looking to start gathering traces. They might not have an existing telemetry backend and setting one up might be something they don't have the time/resources to do. Similarly for tsh --trace, I think that giving folks the option to consume traces however is most useful to them is best. The forwarding option described was chosen as it was the simplest way to get traces to the same telemetry backend used by teleport without users having to know anything about where the traces were going or how to get them there.

It would be easy enough to add flags to tsh that correspond directly to the tracing_service config to override the default forwarding behavior.

  • tsh --trace - forwards spans to the auth server which forwards to it's exporter_url
  • tsh --trace --trace-exporter=file:///tmp/trace.json - writes all spans to /tmp/trace.json
  • tsh --trace --trace-exporter=grpc://grpc://collector.example.com:4317 - exports spans to a remote exporter url

Copy link
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things that were discussed with @gozer and @rosstimothy in light of gravitational/cloud#1765

rfd/0065-distributed-tracing.md Outdated Show resolved Hide resolved
Comment on lines +184 to +185
# the url of the exporter to send spans to - can be http/https/grpc
exporter_url: "http://localhost:14268/api/traces"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a way for teleport itself to use the auth tracing forwarder might be useful.

rfd/0065-distributed-tracing.md Show resolved Hide resolved
@rosstimothy rosstimothy force-pushed the rfd/0065-distributed-tracing branch from 014708e to 625df6d Compare July 21, 2022 14:56
@github-actions github-actions bot removed the request for review from timothyb89 July 21, 2022 14:56
Copy link
Contributor

@xinding33 xinding33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of using OpenTelemetry as it allows for easy export of spans for analysis. Do we want to have someone from the Cloud team take a look given that we'd need to instrument this on Teleport Cloud?

Edit: Cloud has already reviewed this and staging is already ingesting spans.

rfd/0065-distributed-tracing.md Show resolved Hide resolved
@rosstimothy
Copy link
Contributor Author

I'm a fan of using OpenTelemetry as it allows for easy export of spans for analysis. Do we want to have someone from the Cloud team take a look given that we'd need to instrument this on Teleport Cloud?

It doesn't look like @gozer has approved this RFD, but I know he has looked it over in the past. We have been working together to get Cloud to ingest spans, which is currently already happening in staging. In fact, it was his suggestion to tag the forwarded spans so Cloud could potentially omit some spans from their ingestion pipeline.

@gozer
Copy link
Contributor

gozer commented Aug 4, 2022

It doesn't look like @gozer has approved this RFD,

Didn't exactly feel appropriate for me to do so, but I certainly could look again and do so.

but I know he has looked it over in the past.

I sure have

We have been working together to get Cloud to ingest spans, which is currently already happening in staging.

Will be also running in production this week, in theory.

In fact, it was his suggestion to tag the forwarded spans so Cloud could potentially omit some spans from their ingestion pipeline.

Yes, I remember discussing two distinct possiblilties and factored them in the cloud's design. One was to allow differentiation of client initiated traces (tsh --trace) and the possiblity of customer forwarded traces as well.

As an aside, if the plan is to support sending trace data to arbitrary 3rd party destinations, especially by tenant, considerations would have to be made, the current OpenTelemetry Cloud-RDF#0025 had that type of use-case as an explicit Non-Goal.

Could certainly be done, but not right away with what we currently have in place.

@rosstimothy rosstimothy force-pushed the rfd/0065-distributed-tracing branch from 48154dd to bf6b894 Compare August 4, 2022 15:03
@rosstimothy rosstimothy enabled auto-merge (squash) August 4, 2022 15:05
@rosstimothy rosstimothy merged commit efea7bc into master Aug 4, 2022
@zmb3 zmb3 deleted the rfd/0065-distributed-tracing branch September 9, 2022 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants