-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New component: Service graphs processor #9232
Comments
Looks interesting, what will this export specifically? This is one of the challenges in the Jaeger project we solve with Spark jobs or Kafka Streams. Not ideal, and this seems like a better solution if we can get the data we need generated for the Jaeger UI. |
processors:
servicegraphs:
wait: 2s # Value to wait for an edge to be completed
max_items: 200 # Amount of edges that will be stored in the storeMap
workers: 10 # Amount of workers that will be used to process the edges
histogram_buckets: [1, 2, 4, 8, 16, 32, 64] # Buckets for latency histogram in seconds
dimensions: [cluster, namespace] # Additional dimensions (labels) to be added to the metric along with the default ones.
success_codes: # Status codes that are considered successful
http: [404]
grpc: [1, 3, 6] 👍 The |
I noticed that the section How it works wasn't clear enough. I've rewritten it a bit to answer that question. In summary, the processor records metrics. These metrics represent edges in the graphs, while nodes in the graphs are recorded as
The current processor generates metrics based on a specification so it works with Grafana. Since all the efforts have been internal so far, we haven't written a document on this specification, but you can see a table describing the metrics here. We want to keep it compatible with Grafana's current visualization of service graphs, but opening the data generation to other specifications is an open question I guess. |
Hi, @mapno Any update on this new processor? :-) |
@mapno Are you already working on this? I was already implementing something similar for my company, when I found this issue and would be interested to help by implementing this component. :) |
Hey! Apologies for the delay. I've been busy the last couple of weeks. I want to open a PR by the end of the week or next week. My intention is porting the current implementation from Tempo to the collector. Reviews and new ideas will be very welcome :) |
Finally! Opened a PR - #10425. It has some things that need to be improved, but the main architecture and logic behind the processor is there. I think now it's a matter of reviewing the approach and polishing the implementation. |
Great works! |
The component has been merged — #13746. Closing the issue. |
Opentelemetry collector as of today can feed any observability platform with a simplistic form of topology information. Thanks to logz and Grafana for leading the way from spanmetrics processor up to servicegraph processor level. It was one big missing pillar among five APM pillars as defined in APM_Conceptual_Framework.jpg by Gartner. It is a milestone. Architecturally, it is an elegant, brave new placement. Thank you all by heart. |
The purpose and use-cases of the new component
The service graphs processor is a traces processor that builds a map representing the interrelationships between various services in a system. The processor will analyse trace data and generate metrics describing the relationship between the services. These metrics can be used by data visualization apps (e.g. Grafana) to draw a service graph.
Service graphs are useful for a number of use-cases:
Note: This proposal is motivated from this issue: #8998.
How it works
This processor works by inspecting spans and looking for the tag span.kind. If it finds the span kind to be
CLIENT
orSERVER
, it stores the request in a local in-memory store.That request waits until its corresponding client or server pair span is processed or until the maximum waiting time has passed. When either of those conditions is reached, the request is processed and removed from the local store. If the request is complete by that time, it’ll be recorded as an edge in the graph.
Edges are represented as metrics, while nodes in the graphs are recorded as
client
andserver
labels in the metric.Using Grafana Agent's implementation as example: if service A (client) makes a request to service B (server), that metric will get recorded as a timeseries in metric
traces_service_graph_request_total
. In Prometheus representation:Since the service graph processor has to process both sides of an edge, it needs to process all spans of a trace to function properly. If spans of a trace are spread out over multiple pipelines it will not be possible to pair up spans reliably.
TLDR: The processor will try to find spans belonging to requests as seen from the client and the server and will create a metric representing an edge in the graph.
Previous work
This proposal is based on an existing OTel-compatible processor originally built for the Grafana Agent, which then has been ported over to Grafana Tempo and improved further.
This processor was built to work very specifically for the Grafana Agent, Tempo and Grafana and can't be contributed as-is. However, most of the design and logic can be maintained, and porting the remaining bits to OTel is possible.
These are the main points need to be addressed to fit the current implementation to OTel:
Example configuration for the component
Telemetry data types supported
It supports traces only.
Sponsor (Optional)
@jpkrohling has offered to sponsor this new component (see #8998 (comment))
The text was updated successfully, but these errors were encountered: