Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New component: Service graphs processor #9232

Closed
mapno opened this issue Apr 12, 2022 · 10 comments
Closed

New component: Service graphs processor #9232

mapno opened this issue Apr 12, 2022 · 10 comments
Labels
Accepted Component New component has been sponsored

Comments

@mapno
Copy link
Contributor

mapno commented Apr 12, 2022

The purpose and use-cases of the new component

The service graphs processor is a traces processor that builds a map representing the interrelationships between various services in a system. The processor will analyse trace data and generate metrics describing the relationship between the services. These metrics can be used by data visualization apps (e.g. Grafana) to draw a service graph.

Service graphs are useful for a number of use-cases:

  • Infer the topology of a distributed system. As distributed systems grow, they become more complex. Service graphs can help you understand the structure of the system.
  • Provide a high level overview of the health of your system. Service graphs show error rates, latencies, among other relevant data.
  • Provide an historic view of a system’s topology. Distributed systems change very frequently, and service graphs offer a way of seeing how these systems have evolved over time.

Note: This proposal is motivated from this issue: #8998.

How it works

This processor works by inspecting spans and looking for the tag span.kind. If it finds the span kind to be CLIENT or SERVER, it stores the request in a local in-memory store.

That request waits until its corresponding client or server pair span is processed or until the maximum waiting time has passed. When either of those conditions is reached, the request is processed and removed from the local store. If the request is complete by that time, it’ll be recorded as an edge in the graph.

Edges are represented as metrics, while nodes in the graphs are recorded as client and server labels in the metric.

Using Grafana Agent's implementation as example: if service A (client) makes a request to service B (server), that metric will get recorded as a timeseries in metric traces_service_graph_request_total. In Prometheus representation:

traces_service_graph_request_total{client="A",server="B"} 1

Since the service graph processor has to process both sides of an edge, it needs to process all spans of a trace to function properly. If spans of a trace are spread out over multiple pipelines it will not be possible to pair up spans reliably.

TLDR: The processor will try to find spans belonging to requests as seen from the client and the server and will create a metric representing an edge in the graph.

Previous work

This proposal is based on an existing OTel-compatible processor originally built for the Grafana Agent, which then has been ported over to Grafana Tempo and improved further.

This processor was built to work very specifically for the Grafana Agent, Tempo and Grafana and can't be contributed as-is. However, most of the design and logic can be maintained, and porting the remaining bits to OTel is possible.

These are the main points need to be addressed to fit the current implementation to OTel:

  • Primary change: Metrics are built as Prometheus metrics and collected in a Prometheus registry. The processor would have to be changed to create OTLP metrics which will be pushed to a metrics pipeline (same architecture as the spanmetrics processor)
  • The processor relies on receiving complete traces in the same pipeline. This is the same problem that the tailsampling processor and other components have. This would have to be left to the users to solve.
  • Metrics are generated following a specification to work closely with Grafana (you can see a table describing the metrics here). I presume that some configurability would be needed to open up for customization of the outputted metrics.

Example configuration for the component

processors:
  servicegraphs:
    wait: 2s # Value to wait for an edge to be completed
    max_items: 200 # Amount of edges that will be stored in the storeMap
    workers: 10 # Amount of workers that will be used to process the edges
    histogram_buckets: [1, 2, 4, 8, 16, 32, 64] # Buckets for latency histogram in seconds
    dimensions: [cluster, namespace] # Additional dimensions (labels) to be added to the metric along with the default ones.
    success_codes: # Status codes that are considered successful
      http: [404]
      grpc: [1, 3, 6]
 

Telemetry data types supported

It supports traces only.

Sponsor (Optional)

@jpkrohling has offered to sponsor this new component (see #8998 (comment))

@jkowall
Copy link
Contributor

jkowall commented Apr 12, 2022

Looks interesting, what will this export specifically? This is one of the challenges in the Jaeger project we solve with Spark jobs or Kafka Streams. Not ideal, and this seems like a better solution if we can get the data we need generated for the Jaeger UI.

@JaredTan95
Copy link
Member

processors:
  servicegraphs:
    wait: 2s # Value to wait for an edge to be completed
    max_items: 200 # Amount of edges that will be stored in the storeMap
    workers: 10 # Amount of workers that will be used to process the edges
    histogram_buckets: [1, 2, 4, 8, 16, 32, 64] # Buckets for latency histogram in seconds
    dimensions: [cluster, namespace] # Additional dimensions (labels) to be added to the metric along with the default ones.
    success_codes: # Status codes that are considered successful
      http: [404]
      grpc: [1, 3, 6]

👍 The dimensions functionality is great, and the k8s environment is very suitable

@mapno
Copy link
Contributor Author

mapno commented Apr 18, 2022

Looks interesting, what will this export specifically?

I noticed that the section How it works wasn't clear enough. I've rewritten it a bit to answer that question. In summary, the processor records metrics. These metrics represent edges in the graphs, while nodes in the graphs are recorded as client and server labels in the metric.

This is one of the challenges in the Jaeger project we solve with Spark jobs or Kafka Streams. Not ideal, and this seems like a better solution if we can get the data we need generated for the Jaeger UI.

The current processor generates metrics based on a specification so it works with Grafana. Since all the efforts have been internal so far, we haven't written a document on this specification, but you can see a table describing the metrics here.

We want to keep it compatible with Grafana's current visualization of service graphs, but opening the data generation to other specifications is an open question I guess.

@JaredTan95
Copy link
Member

Hi, @mapno Any update on this new processor? :-)

@sarahsporck
Copy link

@mapno Are you already working on this? I was already implementing something similar for my company, when I found this issue and would be interested to help by implementing this component. :)

@mapno
Copy link
Contributor Author

mapno commented May 11, 2022

Hey! Apologies for the delay. I've been busy the last couple of weeks. I want to open a PR by the end of the week or next week. My intention is porting the current implementation from Tempo to the collector. Reviews and new ideas will be very welcome :)

@mapno
Copy link
Contributor Author

mapno commented May 30, 2022

Finally! Opened a PR - #10425. It has some things that need to be improved, but the main architecture and logic behind the processor is there. I think now it's a matter of reviewing the approach and polishing the implementation.

@JaredTan95
Copy link
Member

Finally! Opened a PR - #10425. It has some things that need to be improved, but the main architecture and logic behind the processor is there. I think now it's a matter of reviewing the approach and polishing the implementation.

Great works!

@mapno
Copy link
Contributor Author

mapno commented Sep 7, 2022

The component has been merged — #13746. Closing the issue.

@mapno mapno closed this as completed Sep 7, 2022
@devrimdemiroz
Copy link

devrimdemiroz commented Sep 7, 2022

Opentelemetry collector as of today can feed any observability platform with a simplistic form of topology information. Thanks to logz and Grafana for leading the way from spanmetrics processor up to servicegraph processor level. It was one big missing pillar among five APM pillars as defined in APM_Conceptual_Framework.jpg by Gartner. It is a milestone. Architecturally, it is an elegant, brave new placement. Thank you all by heart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Component New component has been sponsored
Projects
None yet
Development

No branches or pull requests

6 participants