Appetite for a query language? #284

zecke · 2021-10-11T13:08:06Z

One use-case I would like to experiment with is to be able to answer questions across a larger set of deployments to drive optimization efforts. Some of the queries might be along the lines of:

Which binary is the most "expensive" (most CPU, most memory, highest rate allocation) over the last month/day?
Which function is the most "expensive" (most CPU, most memory, highest rate of allocation) over the last month/day? Or narrow this down to to binary (e.g. most expensive function of this binary).
Who could benefit most of a Golang allocator change?
What function to optimize in my library?

The result will be a flat report and not a flamegraph. I wondered whether to approach this by introducing a query language? This requires more thought but on a high-level something like this:

topk(10, merge by (binary) (cpu_profile{binary="frx"}[28d])
topk(10, merge by (binary, function) (allocations{job="abc"}[1d])

Or something more advanced like finding the binaries that allocate most memory in a specific function?

topk(10, merge by (binary) (select(allocations{job="abc"}, {function=~."*runtime.malloc.*"})[28d]))

The text was updated successfully, but these errors were encountered:

thorfour · 2021-10-11T13:39:28Z

Yes! This is definitely something we want to add. Thanks for opening up an issue about it as it's definitely something we need to track.

metalmatze · 2021-10-11T14:00:00Z

As there are many things we don't understand yet, we want to write an in-depth design doc discussing various details for a query language in the next months.

For now, we are probably going to focus on persistent storage a bit more.
Still, we really want this!

brancz · 2021-10-11T14:58:44Z

Fun fact, we already have a language and a parser, it's just super small right now. It's how autocompletion works today. I always imagined there to be an "advanced" mode that was just a plain query input a la Prometheus, that doesn't use any of the guiding UI elements.

Let's use this issue as a place to collect use cases. My top use case that I would like to see, that I cannot do today: I already know the function name of the function that I want to optimize (for example through distributed tracing), so I want to see all data merged that includes traces that include that function, visualized as a flamegraph.

brancz · 2021-10-11T16:19:06Z

Raw thoughts and it's perfectly possible I'm completely wrong (thoughts still developing): I think function selection should be a secondary filter of some sort. My thinking is so we can do something like:

merge(cpu{job="abc", version="v0.1.0"}) - merge(cpu{job="abc", version="v0.1.1"}) | function="functionThatITriedToOptimize"

(the - would be a diff, because that's what it effectively is, though maybe it should be a function to distinguish absolute and relative diffs)

Not saying that I necessarily like this notation, but I think it demonstrates why I think it should be a "second step" filter.

yeya24 · 2021-10-11T22:27:13Z

One thing I found hard to understand is the Query data model in Parca. query_range API returns some metrics series but query API only returns one profile (or one merged profile).

In this case, what's the meaning of cpu_profile{binary="frx"}[28d]? Is this range of metrics or the merged profile?

brancz · 2021-10-12T07:06:40Z

Yeah, I think it can be confusing because query_range and query don't have the same relationship as in Prometheus, but I do think the query_range possibilities will change quite a bit. I imagine being able to visualize the top_k stack traces over time so that the current query_range will actually become sum(<current-query-selector>).

zecke · 2021-10-24T03:09:15Z

Makes sense. One additional use-case might be release qualification/roll-out qualification. This might be a bit far fetched but in a canary judge I would like to know if the canary is (significantly) less efficient than before (or the other running tasks).

Questions:
What is efficient? Number of samples? Can one weight it?
Ideally something like (averaged) cost per query (might need to combine parca and prometheus) over period of time?

brancz · 2021-10-25T08:40:18Z

I'd like to think we can get quite far knowing the duration, period and samples and using that for relative comparisons, but I agree the moment where the canary is not an equal participant in the system it gets significantly harder to judge. I think the need for weighting is inevitable.

brancz · 2021-12-06T13:36:35Z

I think some things are starting to crystalize for me. Primarily that the language should evolve around selection, aggregation, and manipulation of stack traces, as opposed to thinking of "profiles" as a unit (stack traces that have a selector attached to them are instead the unit).

If we think of it in that way, there is no more merging or no merging, everything becomes an aggregation of stack traces, and this can be either at a specific point in time or across time. Happy little accident that so far that's how the selectors happened to also work.

A couple of things in addition to what I think we need to be able to express (and some of these need to be changed in the general UX of querying, not just a query language but I think it goes hand in hand):

Select stack traces regardless of time
- I think this covers the previously mentioned canary use case super well, we don't care over which timeframe a comparison is made, we care across which versions (in the canary case)
Last stack traces seen with selector
- eg. "Show me the last stack traces of heap before the process ended (for example through an out-of-memory error)
Select only a subset of stack traces within profiles originating from targets
- Select by pprof labels
- Select by function/mapping/location being included in the stack trace
- Select by function/mapping/location being ordered in a specific way, eg. main() calling expensiveLoop1()
Cutting stack traces starting at a specific location and using the location that was cut at as the new root
- Takes a location search as parameters
Flipping stack traces, eg. to find out which call-chains caused a certain function to cause CPU cycles (or other measurements the most)

Any combination of these should be diff-able against each other.

sudeep-ib · 2021-12-20T04:07:00Z

Agree with all of the above ^^

I would also love to also see how Parca query language can be used to-
[1] provide ability to write rules that can be used for generating alerts as well.
[2] ability to use the query language over a plugin from grafana for comparing existing stats (like CPU, memory latencies) together with selected profile information time series

let me know if that does not make sense :)

brancz · 2021-12-20T08:27:19Z

Could you explain what kind of alerting would make sense to you?
Makes perfect sense to me, I think we're just still trying to figure out the query patterns and UX before we start integrations into other systems which just makes maintenance harder.

sudeep-ib · 2021-12-20T09:33:09Z

I think we're just still trying to figure out the query patterns and UX before we start integrations into other systems which just makes maintenance harder.

Yes @brancz - that makes sense! I was suggesting this as something we can consider in the mid-term as the project matures. There may also be a case here to see how other tools like Grafana might be open to extending in this direction as well to complement Parca's ability to be a great datastore. (profiling can be a great add-on there from their POV too).

I have just started to use Parca here - so take my suggestions with a grain of salt :)

On [1] my thought was we could look at ability to measure things like time spent in mutex contention or locks or ttot (in python) spent on a fn over some cycles. We could use this together with alerts to highlight regressions or some bad state that the code lead into. We will have more concrete ideas here as we start using this more!

metalmatze · 2022-11-24T11:11:26Z

@javierhonduco and I just had a conversation about the use case for parca-dev/parca-agent#1001
Essentially it boils down to querying the time percentage spent in a specific function over time.
In the end, a time series showing the percentage over 2 or 4 weeks would show the continued effort in performance improvements.

  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca", function="debug/elf.Open"}) 
/
  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca"})


  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca", function=~"debug/elf.*"}) 
/
  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca"})


  sum by(rollout) (parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca", function=~"debug/elf.*"}) 
/
  sum by(rollout) (parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca"})

brancz mentioned this issue Dec 1, 2021

Track number of unique stack traces/locations/mappings/functions seen #477

Open

metalmatze pinned this issue Jul 4, 2022

parca-dev deleted a comment from zongruxie73 Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appetite for a query language? #284

Appetite for a query language? #284

zecke commented Oct 11, 2021

thorfour commented Oct 11, 2021

metalmatze commented Oct 11, 2021

brancz commented Oct 11, 2021

brancz commented Oct 11, 2021

yeya24 commented Oct 11, 2021

brancz commented Oct 12, 2021

zecke commented Oct 24, 2021

brancz commented Oct 25, 2021

brancz commented Dec 6, 2021

sudeep-ib commented Dec 20, 2021

brancz commented Dec 20, 2021 •

edited

Loading

sudeep-ib commented Dec 20, 2021 •

edited

Loading

metalmatze commented Nov 24, 2022

Appetite for a query language? #284

Appetite for a query language? #284

Comments

zecke commented Oct 11, 2021

thorfour commented Oct 11, 2021

metalmatze commented Oct 11, 2021

brancz commented Oct 11, 2021

brancz commented Oct 11, 2021

yeya24 commented Oct 11, 2021

brancz commented Oct 12, 2021

zecke commented Oct 24, 2021

brancz commented Oct 25, 2021

brancz commented Dec 6, 2021

sudeep-ib commented Dec 20, 2021

brancz commented Dec 20, 2021 • edited Loading

sudeep-ib commented Dec 20, 2021 • edited Loading

metalmatze commented Nov 24, 2022

brancz commented Dec 20, 2021 •

edited

Loading

sudeep-ib commented Dec 20, 2021 •

edited

Loading