Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Tracing #12241

Closed
rosstimothy opened this issue Apr 26, 2022 · 1 comment
Closed

Distributed Tracing #12241

rosstimothy opened this issue Apr 26, 2022 · 1 comment
Assignees
Labels
feature-request Used for new features in Teleport, improvements to current should be #enhancements

Comments

@rosstimothy
Copy link
Contributor

What would you like Teleport to do?

Instrument and export distributed tracing spans to make Teleport easier to monitor, manage, and troubleshoot.

This is an umbrella issue for the distributed tracing work including RFD, backend and CLI changes.

What problem does this solve?

Determining and identifying areas of latency, getting a better picture of how things interact across service boundaries, collecting debug information.

If a workaround exists, please include it.

Try and piece together log statements

@rosstimothy rosstimothy added the feature-request Used for new features in Teleport, improvements to current should be #enhancements label Apr 26, 2022
@rosstimothy rosstimothy self-assigned this Apr 26, 2022
@rosstimothy
Copy link
Contributor Author

rosstimothy commented Apr 26, 2022

Tasks related to RFD 65 - at the end teleport will be able to be configured to export spans to a telemetry backend, any spans originated by tsh will be forwarded to the auth server and exported as well.

Server-side changes

  • Add auto instrumentation for http and grpc clients and servers
  • Add instrumentation to ssh clients and servers
  • Add and consume tracing configuration per RFD 65
  • Configure tracing provider and export spans to configured exporter
  • Add span exporter grpc service to auth server
  • Instrument cache
  • Instrument backend
  • Add other instrumentation

tsh changes

  • Forward spans to auth server
  • Add instrumentation to tsh client
  • Add new flag to force spans to be exported

Documentation

  • Write public-facing docs
  • Update RFD

rosstimothy added a commit that referenced this issue Jun 11, 2022
Create spans for all public facing TeleportClient,
ProxyClient, and NodeClient methods. This makes
correlating spans easier to reason about when
looking at `tsh` traces. As a result of creating
spans, some additional context propagation is
required as well to ensure that spans are linked
properly.

This also removes the unused `quiet` argument from
`ConnectToCluster`. It's usage was not consistent
by existing callers, and it was ignored, so in order
to avoid confusion in future calls, it was removed.

#12241
rosstimothy added a commit that referenced this issue Jun 11, 2022
Adds a `trace.Tracer` to the `backend.Reporter`
wrapper so that all `bakend.Backend` implementations
can be traced. Further instrumentation of each specific
backend will be added at a later date to see how long
each sql query, or call to dynamo/etcd took within
each backend operation.

#12241
rosstimothy added a commit that referenced this issue Jun 24, 2022
Add wrappers for ssh.Client, ssh.Session, ssh.Channel, ssh.ServerConn,
and ssh.NewCh that pass tracing context along with all ssh messages.
In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an envelope that includes the original payload
AND any trace context. Servers now try to unmarshal all payloads
into an envelope when processing messages. If there is an envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of #12241
rosstimothy added a commit that referenced this issue Jul 5, 2022
Adds a `trace.Tracer` to the `backend.Reporter`
wrapper so that all `bakend.Backend` implementations
can be traced. Further instrumentation of each specific
backend will be added at a later date to see how long
each sql query, or call to dynamo/etcd took within
each backend operation.

#12241
rosstimothy added a commit that referenced this issue Jul 6, 2022
Add tracing support for ssh global requests and  channels. Wrappers
for `ssh.Client`, `ssh.Channel`, and `ssh.NewChannel` provide a
mechanism for tracing context to be propagated via a `context.Context`.

In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an Envelope that includes the original payload
AND any tracing context. Servers now try to unmarshal all payloads
into said Envelope when processing messages. If there is an Envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of #12241
rosstimothy added a commit that referenced this issue Jul 6, 2022
Add tracing support for ssh global requests and  channels. Wrappers
for `ssh.Client`, `ssh.Channel`, and `ssh.NewChannel` provide a
mechanism for tracing context to be propagated via a `context.Context`.

In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an Envelope that includes the original payload
AND any tracing context. Servers now try to unmarshal all payloads
into said Envelope when processing messages. If there is an Envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of #12241
rosstimothy added a commit that referenced this issue Jul 9, 2022
Tracing clients can detect if a server doesn't support tracing by
checking for a trace.NotImplented error in response to an
UploadTraces request. Since the grpc.Conn used by the client is
likely to be bound to that server for the duration of its life
it doesn't make sense to keep trying to forward traces. Instead
the client now remembers that a server doesn't support tracing
and will drop any spans.

Part of #12241
rosstimothy added a commit that referenced this issue Jul 28, 2022
Adds a wrapper around `ssh.Session` which injects tracing context
in a similar manner to the `ssh.Client` wrapper. All usages of
`ssh.Session` have now been replaced and have the appropriate
`context.Context` passed along

Part of #12241
rosstimothy added a commit that referenced this issue Jul 28, 2022
Add tracing support for ssh global requests and  channels. Wrappers
for `ssh.Client`, `ssh.Channel`, and `ssh.NewChannel` provide a
mechanism for tracing context to be propagated via a `context.Context`.

In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an Envelope that includes the original payload
AND any tracing context. Servers now try to unmarshal all payloads
into said Envelope when processing messages. If there is an Envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of #12241
rosstimothy added a commit that referenced this issue Aug 1, 2022
Adds a `teleport.forwarded.for` attribute to all spans that are
forwarded to the auth server. This allows consumers of the spans
to identify where the spans are coming from and take possible
action. In some scenarios it may be desirable to drop forwarded
spans along the collection process, by tagging them we can
provide a way for those consumers to identify them. It also
allows for potentially identifying a malicious user that may
be trying to spam the telemetry backend with spans.

Part of #12241
rosstimothy added a commit that referenced this issue Aug 1, 2022
SSH request tracing (#14124)

Add tracing support for ssh global requests and  channels. Wrappers
for `ssh.Client`, `ssh.Channel`, and `ssh.NewChannel` provide a
mechanism for tracing context to be propagated via a `context.Context`.

In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an Envelope that includes the original payload
AND any tracing context. Servers now try to unmarshal all payloads
into said Envelope when processing messages. If there is an Envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of #12241
rosstimothy added a commit that referenced this issue Aug 2, 2022
* Tag forwarded spans with custom attributes

Adds a `teleport.forwarded.for` attribute to a resource or 
all spans that are forwarded to the auth server. This allows 
consumers of the spans to identify where the spans are coming 
from and take possible action. In some scenarios it may
be desirable to drop forwarded spans along the collection 
process, by tagging them we can provide a way for those 
consumers to identify them. It also allows for potentially 
identifying a malicious user that may be trying to spam the 
telemetry backend with spans.

Part of #12241
github-actions bot pushed a commit that referenced this issue Aug 4, 2022
Adds a wrapper around `ssh.Session` which injects tracing context
in a similar manner to the `ssh.Client` wrapper. All usages of
`ssh.Session` have now been replaced and have the appropriate
`context.Context` passed along

Part of #12241
reedloden pushed a commit that referenced this issue Aug 10, 2022
Trace ssh sessions

Adds a wrapper around `ssh.Session` which injects tracing context
in a similar manner to the `ssh.Client` wrapper. All usages of
`ssh.Session` have now been replaced and have the appropriate
`context.Context` passed along

Part of #12241
reedloden pushed a commit that referenced this issue Aug 10, 2022
Tag forwarded spans with custom attributes (#14706)

* Tag forwarded spans with custom attributes

Adds a `teleport.forwarded.for` attribute to a resource or
all spans that are forwarded to the auth server. This allows
consumers of the spans to identify where the spans are coming
from and take possible action. In some scenarios it may
be desirable to drop forwarded spans along the collection
process, by tagging them we can provide a way for those
consumers to identify them. It also allows for potentially
identifying a malicious user that may be trying to spam the
telemetry backend with spans.

Part of #12241
reedloden pushed a commit that referenced this issue Aug 15, 2022
SSH request tracing

Add tracing support for ssh global requests and  channels. Wrappers
for `ssh.Client`, `ssh.Channel`, and `ssh.NewChannel` provide a
mechanism for tracing context to be propagated via a `context.Context`.

In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an Envelope that includes the original payload
AND any tracing context. Servers now try to unmarshal all payloads
into said Envelope when processing messages. If there is an Envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of #12241
reedloden pushed a commit that referenced this issue Aug 15, 2022
Manually instrument `backend.Backend` (#13268)

Adds a `trace.Tracer` to the `backend.Reporter`
wrapper so that all `bakend.Backend` implementations
can be traced. Further instrumentation of each specific
backend will be added at a later date to see how long
each sql query, or call to dynamo/etcd took within
each backend operation.

#12241
logand22 pushed a commit that referenced this issue Aug 19, 2022
Tag forwarded spans with custom attributes (#14706)

* Tag forwarded spans with custom attributes

Adds a `teleport.forwarded.for` attribute to a resource or
all spans that are forwarded to the auth server. This allows
consumers of the spans to identify where the spans are coming
from and take possible action. In some scenarios it may
be desirable to drop forwarded spans along the collection
process, by tagging them we can provide a way for those
consumers to identify them. It also allows for potentially
identifying a malicious user that may be trying to spam the
telemetry backend with spans.

Part of #12241
logand22 pushed a commit that referenced this issue Aug 19, 2022
…#15480)

Prevent forwarding traces to servers which don't support tracing (#14281)

* Prevent forwarding traces to servers which don't support tracing

Tracing clients can detect if a server doesn't support tracing by
checking for a trace.NotImplented error in response to an
UploadTraces request. Since the grpc.Conn used by the client is
likely to be bound to that server for the duration of its life
it doesn't make sense to keep trying to forward traces. Instead
the client now remembers that a server doesn't support tracing
and will drop any spans.

Part of #12241
rosstimothy added a commit that referenced this issue Aug 24, 2022
Trace ssh sessions (#14966)

Adds a wrapper around `ssh.Session` which injects tracing context
in a similar manner to the `ssh.Client` wrapper. All usages of
`ssh.Session` have now been replaced and have the appropriate
`context.Context` passed along

Part of #12241
rosstimothy added a commit that referenced this issue Aug 24, 2022
Trace ssh sessions (#14966)

Adds a wrapper around `ssh.Session` which injects tracing context
in a similar manner to the `ssh.Client` wrapper. All usages of
`ssh.Session` have now been replaced and have the appropriate
`context.Context` passed along

Part of #12241
hydridity pushed a commit to hydridity/teleport that referenced this issue Aug 26, 2022
SSH request tracing (gravitational#14124)

* SSH request tracing

Add tracing support for ssh global requests and  channels. Wrappers
for `ssh.Client`, `ssh.Channel`, and `ssh.NewChannel` provide a
mechanism for tracing context to be propagated via a `context.Context`.

In order to maintain backwards compatibility the ssh.Client wrapper
tries to open a TracingChannel when constructed. Any servers that
don't support tracing will reject the unknown channel. The client
will only provide tracing context to servers which do NOT reject
the TracingChannel request.

In order to include pass tracing context along all ssh payloads
are wrapped in an Envelope that includes the original payload
AND any tracing context. Servers now try to unmarshal all payloads
into said Envelope when processing messages. If there is an Envelope
provided, a new span will be created and the original payload will
be pass along to handlers.

Part of gravitational#12241
hydridity pushed a commit to hydridity/teleport that referenced this issue Aug 26, 2022
Manually instrument `backend.Backend` (gravitational#13268)

Adds a `trace.Tracer` to the `backend.Reporter`
wrapper so that all `bakend.Backend` implementations
can be traced. Further instrumentation of each specific
backend will be added at a later date to see how long
each sql query, or call to dynamo/etcd took within
each backend operation.

gravitational#12241
hydridity pushed a commit to hydridity/teleport that referenced this issue Aug 26, 2022
Allow traces to be exported to files (gravitational#14332)

* Allow traces to be exported to files

Adds support for exporting traces to a file. While not recommended
for production use, some folks may need to collect traces without
having any telemetry infrastructure in place to store them. To do
so they can simply update their tracing_service to point to a
directory, as seen in the following config snippet.

```yaml
tracing_service:
   exporter_url: "file:///var/lib/teleport/traces"
```

The file contents will contain one json encoded otlp trace per line.
Files written by the exporter will all follow the following naming
convention:  <unix_timestamp>-<random_number>.trace

To prevent a trace file from growing unbound forever, there is a
default limit of 100MB, after which, the file will be rotated for
a brand new file. Users can adjust the file size limit by adding
a query paramter to the exporter url like: `?limit=12345`.

Part of gravitational#12241
hydridity pushed a commit to hydridity/teleport that referenced this issue Aug 26, 2022
…gravitational#15479)

Prevent forwarding traces to servers which don't support tracing (gravitational#14281)

* Prevent forwarding traces to servers which don't support tracing

Tracing clients can detect if a server doesn't support tracing by
checking for a trace.NotImplented error in response to an
UploadTraces request. Since the grpc.Conn used by the client is
likely to be bound to that server for the duration of its life
it doesn't make sense to keep trying to forward traces. Instead
the client now remembers that a server doesn't support tracing
and will drop any spans.

Part of gravitational#12241
hydridity pushed a commit to hydridity/teleport that referenced this issue Aug 26, 2022
Tag forwarded spans with custom attributes (gravitational#14706)

* Tag forwarded spans with custom attributes

Adds a `teleport.forwarded.for` attribute to a resource or
all spans that are forwarded to the auth server. This allows
consumers of the spans to identify where the spans are coming
from and take possible action. In some scenarios it may
be desirable to drop forwarded spans along the collection
process, by tagging them we can provide a way for those
consumers to identify them. It also allows for potentially
identifying a malicious user that may be trying to spam the
telemetry backend with spans.

Part of gravitational#12241
hatched pushed a commit that referenced this issue Aug 30, 2022
Allow traces to be exported to files (#14332)

* Allow traces to be exported to files

Adds support for exporting traces to a file. While not recommended
for production use, some folks may need to collect traces without
having any telemetry infrastructure in place to store them. To do
so they can simply update their tracing_service to point to a
directory, as seen in the following config snippet.

```yaml
tracing_service:
   exporter_url: "file:///var/lib/teleport/traces"
```

The file contents will contain one json encoded otlp trace per line.
Files written by the exporter will all follow the following naming
convention:  <unix_timestamp>-<random_number>.trace

To prevent a trace file from growing unbound forever, there is a
default limit of 100MB, after which, the file will be rotated for
a brand new file. Users can adjust the file size limit by adding
a query paramter to the exporter url like: `?limit=12345`.

Part of #12241
rosstimothy added a commit that referenced this issue Sep 14, 2022
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
rosstimothy added a commit that referenced this issue Sep 19, 2022
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
rosstimothy added a commit that referenced this issue Sep 29, 2022
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
rosstimothy added a commit that referenced this issue Oct 10, 2022
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
rosstimothy added a commit that referenced this issue Oct 14, 2022
* Enhance diagnostics docs

The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
github-actions bot pushed a commit that referenced this issue Oct 14, 2022
The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
rosstimothy added a commit that referenced this issue Oct 14, 2022
* Enhance diagnostics docs

The diagnostics docs were all in one page(`setup/reference/metrics.mdx`)
which made things a bit hard to find, and wasn't conducive to adding
new content.

Diagnostics information is now located at `setup/diagnostics` with each
section from the original `metrics.mdx` doc now having its own dedicated
page. The section will now show up in the navbar at
"Setup" > "Monitoring Your Cluster".

The content of the profiling page is expanded to include information from
https://github.com/gravitational/teleport/blob/740d184d1cfc69ae2e96c50ee738b13884fb232b/assets/monitoring/README.md#low-level-monitoring
to illustrate what the different profile types are, the information that
they capture, and how to retrieve them.

A new Distributed Tracing page is also added to instruct users on how to setup
the `tracing_service` to collect and export spans (the last open item
for #12241).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Used for new features in Teleport, improvements to current should be #enhancements
Projects
None yet
Development

No branches or pull requests

1 participant