-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: export profile metrics at compaction time #3718
base: main
Are you sure you want to change the base?
POC: export profile metrics at compaction time #3718
Conversation
functions := map[string]*queryv1.FunctionList{ | ||
// TODO: | ||
// This must be richer. First, it should be split by tenant. | ||
// Also, we could have functions associated to service_name | ||
// while others are just collected generally no matter what | ||
// service_name we handle | ||
"pyroscope": { | ||
Functions: []string{ | ||
"net/http.HandlerFunc.ServeHTTP", | ||
"runtime.gcBgMarkWorker", | ||
}, | ||
}, | ||
"ride-sharing-app": { | ||
Functions: []string{ | ||
"net/http.HandlerFunc.ServeHTTP", | ||
"runtime.gcBgMarkWorker", | ||
}, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics to export need to be from config.
This PoC export every dimension/profile-type of every tenant and service name, and the functions configured in this map. This should all be configurable.
reader := query_backend.NewBlockReader(w.logger, w.storage) | ||
var res, _ = reader.Invoke(ctx, | ||
&queryv1.InvokeRequest{ | ||
Tenant: []string{c.TenantId}, | ||
StartTime: c.MinTime, | ||
EndTime: c.MaxTime, | ||
Query: []*queryv1.Query{{ | ||
QueryType: queryv1.QueryType_QUERY_METRICS, | ||
Metrics: &queryv1.MetricsQuery{ | ||
FunctionsByServiceName: functions, | ||
}, | ||
}}, | ||
QueryPlan: &queryv1.QueryPlan{ | ||
Root: &queryv1.QueryNode{ | ||
Blocks: []*metastorev1.BlockMeta{c}, | ||
}, | ||
}, | ||
LabelSelector: "{}", | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New query type "Metrics" help with this task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Barely the same as Anton's exporter. #2899
Username: "1741027", | ||
Password: "omitted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was exporting to my grafana cloud instance. But this should come from config, either a single target datasource for OS or integrated with grafana cloud stacks
if len(by) == 0 { | ||
fp, err = reader.Series(postings.At(), &l, &chunks) | ||
} else { | ||
fp, err = reader.SeriesBy(postings.At(), &l, &chunks, by...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was deleting my labels and was keeping me from fetching them, so I used Series
instead when by
is empty.
for _, c := range compacted { | ||
reader := query_backend.NewBlockReader(w.logger, w.storage) | ||
var res, _ = reader.Invoke(ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite an original solution. However, to my mind, the only reason we're implementing this in compactor is that we don't want to query data. I might be missing something but this is what we're actually doing here: querying our compacted blocks.
I propose an alternative approach:
- We inject our component that extracts data from samples to the block merge process here.
- For every label set we encounter while handling the stream (a unique
Series ID
after theindexRewriter.rewriteRow
call): we check if it matches any of the rules (we still want to support label filters) - If series labels (
ProfileEntry.Labels
) match any rule, handle grouping to get the labels we want to preserve in the output time series (metrics). A conjunction of thegroup by
label values and the rule name identifies the aggregator for the time series. - We add function/call-chain filtering later. For now we focus on the core functionality.
Basically, something very similar to our downsampler, but simpler, as we don't need to handle writes to aux tables.
If we want to generate metrics with "weighted" values, where the values are fractions of the profile total (and we do, I believe), we need to do some more work. We need to get the profile total, not the row total, which is a single series sample set, while a profile may include multiple series. To get a profile total, we need to aggregate row totals by the ProfileID
column: every profile has a unique ID, and as of v2, it's guaranteed, that all the profile samples will be present in the same table, regardless of their labels.
I suggest that we add it later, together with function/call-chain filtering.
@@ -83,6 +83,7 @@ message Query { | |||
TimeSeriesQuery time_series = 5; | |||
TreeQuery tree = 6; | |||
PprofQuery pprof = 7; | |||
MetricsQuery metrics = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can implement the feature without extending the query API.
|
||
labelsFromFp := make(map[uint64]phlaremodel.Labels) | ||
builders := make(map[uint64]map[string]*phlaremodel.TimeSeriesBuilder) | ||
for rows.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this loop will suffer from very bad performance. Let's skip function filtering for now and add it later – we already have stack trace filter that can handle that, we only need to integrate it.
Regarding the compaction lag: it's totally possible that L0 compaction is delayed because we only compact data once we accumulated enough blocks. However, in the PR you mentioned, we added a parameter that controls for how long a block might be staged. We also introduced an indicator – time to compaction. In our dev env, L0 compaction lag does not exceed 1m and p99 is around 15-20 seconds. I think that relying on the "current" time might be dangerous – we could explore an option where we get timestamp of the blocks (the time they are created). Also, I think that OOO ingestion is almost inevitable: jobs might be run concurrently, and their order is not guaranteed (we don't need it for compaction); usually, this is not an issue, but if the job fails and we retry (which we do), we will likely violate the order. Fortunately, both Mimit and Prometheus handle OOO (with some caveats) |
Prerequisites
Exporting profile metrics at compaction time
This PoC shows how could be export metrics from profiles at compaction time (in fact we do this right after compaction, not at compaction time).
Compaction is something that happens eventually in every block of our object storage. This approach offers some benefits over exporting at ingestion time, as described by tempo's members:
In theory, the first level of compaction (L0 blocks to L1 blocks) is done shortly after the data ingestion (~10s). But in practice, I've observed that L0 compaction happens every 30-120s. I don't know the reasons of such delay (maybe data ingestion is low and compaction happen less often? I only ingest data of 1 tenant with 2 services - every 15s aprox)
Generated metrics
Now that we have a prototype running, we can get a picture of how generated metrics look like.
Dimensions
Every profile type or dimension is exported as a metric with this format:
So for example, if a service writes profile data of 3 different
__profile_type__
, we will export 3 different metrics:process_cpu:cpu:nanoseconds:cpu:nanoseconds
pyroscope_exported_metrics_process_cpu_cpu_nanoseconds_cpu_nanoseconds
memory:alloc_objects:count:space:bytes
pyroscope_exported_metrics_memory_alloc_objects_count_space_bytes
memory:alloc_space:bytes:space:bytes
pyroscope_exported_metrics_memory_alloc_space_bytes_space_bytes
Labels are preserved, unrolling new series for each labelset. So we can query for CPU of a specific pod of a service and some other pprof label like this:
Dimensions metrics are exported for every tenant and every service_name, but this should be configurable by the user.
Functions
This prototype explores also the ability to export metrics on specific functions. We can chose an interesting function to export.
Now it's exporting data for every dimension of the given function under this format:
In this prototype I've hardcoded Garbage colector and HTTP functions to export, for every service_name. I haven't make distinction on tenant yet. The functions to export should come from config (UI is a must here).
In the future, we could specify a filter of LabelSets instead of exporting by service_name. So for example
"foo": "{}"
would export every profile offoo
function. And"foo": "{service_name=\"my-service\", vehicle=\"bike\"}"
would export only for that service_name and vehicle.Detected challenges
This naive solution is full of trade-offs and assumptions and it's far from being a final solution. I've detected some challenges:
DEMO
I have a pyroscope with the changes running in my machine while exporting metrics to my grafana cloud instance.
Go grant yourself privileges in the admin page:
You can take a look on exported metrics here:
https://albertosotogcp.grafana.net/explore/metrics/trail?from=now-1h&to=now&timezone=browser&var-ds=grafanacloud-prom&var-otel_resources=&var-filters=&var-deployment_environment=&metricSearch=pyroscope_&metricPrefix=all
You can see a demo dashboard, were I tried to simulate an alert of >20% of CPU of garbage collection or >60% of memory in HTTP requests:
https://albertosotogcp.grafana.net/goto/eFRA8E7NR?orgId=1