New component: Service graphs processor #9232

mapno · 2022-04-12T16:50:51Z

The purpose and use-cases of the new component

The service graphs processor is a traces processor that builds a map representing the interrelationships between various services in a system. The processor will analyse trace data and generate metrics describing the relationship between the services. These metrics can be used by data visualization apps (e.g. Grafana) to draw a service graph.

Service graphs are useful for a number of use-cases:

Infer the topology of a distributed system. As distributed systems grow, they become more complex. Service graphs can help you understand the structure of the system.
Provide a high level overview of the health of your system. Service graphs show error rates, latencies, among other relevant data.
Provide an historic view of a system’s topology. Distributed systems change very frequently, and service graphs offer a way of seeing how these systems have evolved over time.

Note: This proposal is motivated from this issue: #8998.

How it works

This processor works by inspecting spans and looking for the tag span.kind. If it finds the span kind to be CLIENT or SERVER, it stores the request in a local in-memory store.

That request waits until its corresponding client or server pair span is processed or until the maximum waiting time has passed. When either of those conditions is reached, the request is processed and removed from the local store. If the request is complete by that time, it’ll be recorded as an edge in the graph.

Edges are represented as metrics, while nodes in the graphs are recorded as client and server labels in the metric.

Using Grafana Agent's implementation as example: if service A (client) makes a request to service B (server), that metric will get recorded as a timeseries in metric traces_service_graph_request_total. In Prometheus representation:

traces_service_graph_request_total{client="A",server="B"} 1

Since the service graph processor has to process both sides of an edge, it needs to process all spans of a trace to function properly. If spans of a trace are spread out over multiple pipelines it will not be possible to pair up spans reliably.

TLDR: The processor will try to find spans belonging to requests as seen from the client and the server and will create a metric representing an edge in the graph.

Previous work

This proposal is based on an existing OTel-compatible processor originally built for the Grafana Agent, which then has been ported over to Grafana Tempo and improved further.

This processor was built to work very specifically for the Grafana Agent, Tempo and Grafana and can't be contributed as-is. However, most of the design and logic can be maintained, and porting the remaining bits to OTel is possible.

These are the main points need to be addressed to fit the current implementation to OTel:

Primary change: Metrics are built as Prometheus metrics and collected in a Prometheus registry. The processor would have to be changed to create OTLP metrics which will be pushed to a metrics pipeline (same architecture as the spanmetrics processor)
The processor relies on receiving complete traces in the same pipeline. This is the same problem that the tailsampling processor and other components have. This would have to be left to the users to solve.
Metrics are generated following a specification to work closely with Grafana (you can see a table describing the metrics here). I presume that some configurability would be needed to open up for customization of the outputted metrics.

Example configuration for the component

processors:
  servicegraphs:
    wait: 2s # Value to wait for an edge to be completed
    max_items: 200 # Amount of edges that will be stored in the storeMap
    workers: 10 # Amount of workers that will be used to process the edges
    histogram_buckets: [1, 2, 4, 8, 16, 32, 64] # Buckets for latency histogram in seconds
    dimensions: [cluster, namespace] # Additional dimensions (labels) to be added to the metric along with the default ones.
    success_codes: # Status codes that are considered successful
      http: [404]
      grpc: [1, 3, 6]

Telemetry data types supported

It supports traces only.

Sponsor (Optional)

@jpkrohling has offered to sponsor this new component (see #8998 (comment))

The text was updated successfully, but these errors were encountered:

jkowall · 2022-04-12T19:54:34Z

Looks interesting, what will this export specifically? This is one of the challenges in the Jaeger project we solve with Spark jobs or Kafka Streams. Not ideal, and this seems like a better solution if we can get the data we need generated for the Jaeger UI.

JaredTan95 · 2022-04-13T01:35:46Z

processors:
  servicegraphs:
    wait: 2s # Value to wait for an edge to be completed
    max_items: 200 # Amount of edges that will be stored in the storeMap
    workers: 10 # Amount of workers that will be used to process the edges
    histogram_buckets: [1, 2, 4, 8, 16, 32, 64] # Buckets for latency histogram in seconds
    dimensions: [cluster, namespace] # Additional dimensions (labels) to be added to the metric along with the default ones.
    success_codes: # Status codes that are considered successful
      http: [404]
      grpc: [1, 3, 6]

👍 The dimensions functionality is great, and the k8s environment is very suitable

mapno · 2022-04-18T09:13:05Z

Looks interesting, what will this export specifically?

I noticed that the section How it works wasn't clear enough. I've rewritten it a bit to answer that question. In summary, the processor records metrics. These metrics represent edges in the graphs, while nodes in the graphs are recorded as client and server labels in the metric.

This is one of the challenges in the Jaeger project we solve with Spark jobs or Kafka Streams. Not ideal, and this seems like a better solution if we can get the data we need generated for the Jaeger UI.

The current processor generates metrics based on a specification so it works with Grafana. Since all the efforts have been internal so far, we haven't written a document on this specification, but you can see a table describing the metrics here.

We want to keep it compatible with Grafana's current visualization of service graphs, but opening the data generation to other specifications is an open question I guess.

JaredTan95 · 2022-05-06T02:15:54Z

Hi, @mapno Any update on this new processor? :-)

sarahsporck · 2022-05-10T07:22:46Z

@mapno Are you already working on this? I was already implementing something similar for my company, when I found this issue and would be interested to help by implementing this component. :)

mapno · 2022-05-11T15:47:02Z

Hey! Apologies for the delay. I've been busy the last couple of weeks. I want to open a PR by the end of the week or next week. My intention is porting the current implementation from Tempo to the collector. Reviews and new ideas will be very welcome :)

mapno · 2022-05-30T15:04:57Z

Finally! Opened a PR - #10425. It has some things that need to be improved, but the main architecture and logic behind the processor is there. I think now it's a matter of reviewing the approach and polishing the implementation.

JaredTan95 · 2022-05-31T00:34:07Z

Finally! Opened a PR - #10425. It has some things that need to be improved, but the main architecture and logic behind the processor is there. I think now it's a matter of reviewing the approach and polishing the implementation.

Great works!

mapno · 2022-09-07T11:16:43Z

The component has been merged — #13746. Closing the issue.

devrimdemiroz · 2022-09-07T22:12:42Z

Opentelemetry collector as of today can feed any observability platform with a simplistic form of topology information. Thanks to logz and Grafana for leading the way from spanmetrics processor up to servicegraph processor level. It was one big missing pillar among five APM pillars as defined in APM_Conceptual_Framework.jpg by Gartner. It is a milestone. Architecturally, it is an elegant, brave new placement. Thank you all by heart.

mapno mentioned this issue Apr 12, 2022

[Processor] is there a way to calculate service topology with processor? #8998

Closed

jpkrohling added the Accepted Component New component has been sponsored label Apr 12, 2022

mapno mentioned this issue May 30, 2022

Service graph processor #10425

Closed

mapno mentioned this issue Aug 31, 2022

[New Component] Service graph processor #13746

Merged

mapno closed this as completed Sep 7, 2022

JaredTan95 mentioned this issue Dec 22, 2022

[proposal] servicegraph support virtual/peer node #17196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New component: Service graphs processor #9232

New component: Service graphs processor #9232

mapno commented Apr 12, 2022 •

edited

Loading

jkowall commented Apr 12, 2022

JaredTan95 commented Apr 13, 2022

mapno commented Apr 18, 2022 •

edited

Loading

JaredTan95 commented May 6, 2022

sarahsporck commented May 10, 2022

mapno commented May 11, 2022

mapno commented May 30, 2022

JaredTan95 commented May 31, 2022

mapno commented Sep 7, 2022

devrimdemiroz commented Sep 7, 2022 •

edited

Loading

New component: Service graphs processor #9232

New component: Service graphs processor #9232

Comments

mapno commented Apr 12, 2022 • edited Loading

The purpose and use-cases of the new component

How it works

Previous work

Example configuration for the component

Telemetry data types supported

Sponsor (Optional)

jkowall commented Apr 12, 2022

JaredTan95 commented Apr 13, 2022

mapno commented Apr 18, 2022 • edited Loading

JaredTan95 commented May 6, 2022

sarahsporck commented May 10, 2022

mapno commented May 11, 2022

mapno commented May 30, 2022

JaredTan95 commented May 31, 2022

mapno commented Sep 7, 2022

devrimdemiroz commented Sep 7, 2022 • edited Loading

mapno commented Apr 12, 2022 •

edited

Loading

mapno commented Apr 18, 2022 •

edited

Loading

devrimdemiroz commented Sep 7, 2022 •

edited

Loading