feat: Introduce shardable probabilistic topk for instant queries. #14243

jeschkies · 2024-09-24T11:18:08Z

What this PR does / why we need it:
This change introduces a very simplified shardable topk approximation through the new vector aggregation approx_topk.

We use a count min sketch and track all labels not just the top k. Since this list can grow quite large the feature is only supported for instant queries. Grouping is also not supported and should be handled by an inner sum by or sum without even though this might not be the same behaviour as topk by.

The sharding works by turning the approx_topk(k, inner) query into the following expression:

topk(
  k,
  eval_cms(
    __count_min_sketch__(inner, shard=1) ++ __count_min_sketch__(inner, shard=2)...
  )
)

__count_min_sketch__ is calculated for each shard and merge on the frontend. eval_cms iterates through the labels list and determines the count for each. topk selects then the top items.

The number of labels tracked on the querier side when evaluating __count_min_sketch__ is limited by the heap. It does count all values but the ke-value pairs might not be known.

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

jeschkies · 2024-09-24T13:51:29Z

pkg/logql/sketch/cms.go

+	width := math.Ceil(math.E / epsilon)
+	depth := math.Ceil(math.Log(1.0 / delta))


See https://cs.stackexchange.com/questions/166937/what-is-the-proper-way-to-calculate-dimensions-of-count-min-sketch

this is the calculation I used when doing all of our experiments, but I can't remember if I got good results using it

I figured we have to stay here because this makes the matrix roughly the size of our max 10k vector.

cstyan

@jeschkies what's the point of using the sketch if we're just going to send all the labels of everything we've observed in each sharded query over the wire?

If we're doing that then we might as well just have a list of tuples, labelset + count, and remove the sketch. Or use the map approach. We'd be trading space for not having to search into the slice with the map 🤷‍♂️

cstyan · 2024-09-25T22:38:42Z

pkg/logql/sketch/cms.go

+	width := math.Ceil(math.E / epsilon)
+	depth := math.Ceil(math.Log(1.0 / delta))


this is the calculation I used when doing all of our experiments, but I can't remember if I got good results using it

jeschkies · 2024-09-26T08:26:57Z

If we're doing that then we might as well just have a list of tuples, labelset + count, and remove the sketch.

@cstyan you are right. My rationale was that we could first validate if the team and our users accept instant queries with a special probabilistic key word and see what the implementation looks like. Then worry about the algorithm. As for the lables I'd keep it super simple and use a min-heap with max 10k labels. As Ed explained the most prominent use case is for finding the top outliers in a high cardinality set.

cstyan

This is a bit hard to review with all the other unintended commits, but I think we can proceed with this implementation. I'm not entirely following how the labels tracking and heap are wired up ATM

Also, we should either clean up and reuse or completely remove the sketch/topk.go file as part of this PR.

pkg/logql/count_min_sketch.go

cstyan · 2024-10-01T22:37:34Z

pkg/logql/count_min_sketch.go

+
+	// Add our metric if we haven't seen it
+	if _, ok := v.observed[metricString]; !ok {
+		heap.Push(v, metric)


where does the heap actually exist?

HeapCountMinSketchVector embeds a CountMinSketchVector and treats the metrics array as a heap. We should look into https://grafana.com/blog/2024/04/23/the-loser-tree-data-structure-how-to-optimize-merges-and-make-your-programs-run-faster/ actually.

pkg/logql/shardmapper.go

pkg/logql/count_min_sketch.go

topk actually is. Signed-off-by: Callum Styan <[email protected]>

cstyan · 2024-10-07T20:34:14Z

pkg/logql/downstream_test.go

+		{
+			labelShards:   10, // increasing this will make the test too slow
+			totalStreams:  1_000_000,
+			shardedQuery:  `approx_topk(100, sum by (a) (sum_over_time ({a=~".+"} | logfmt | unwrap value [1s])))`,
+			regularQuery:  `topk(100, sum by (a) (sum_over_time ({a=~".+"} | logfmt | unwrap value [1s])))`,
+			realtiveError: 0.0015,
+		},


I think this is one of the things that is making some these test results seem better than they would be in reality; here you've got labelShards: 10, which means that there will only be 10 unique values for the label a but we're doing approx_topk(100, ...) as the query.

You are right. So this test is actually no good. What we need are 1 million unique values...

Signed-off-by: Callum Styan <[email protected]>

counters to float64 so we can get float results in queries Signed-off-by: Callum Styan <[email protected]>

…4243) Signed-off-by: Callum Styan <[email protected]> Co-authored-by: Callum Styan <[email protected]> (cherry picked from commit 7b53f20)

jeschkies requested review from cstyan and slim-bean September 24, 2024 11:18

pull-request-size bot added the size/XL label Sep 24, 2024

jeschkies commented Sep 24, 2024

View reviewed changes

cstyan reviewed Sep 25, 2024

View reviewed changes

jeschkies requested a review from cstyan September 30, 2024 14:34

github-actions bot added sig/operator area/helm type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories labels Oct 1, 2024

pull-request-size bot added size/XXL and removed size/XL labels Oct 1, 2024

cstyan reviewed Oct 1, 2024

View reviewed changes

jeschkies added 14 commits October 2, 2024 07:51

feat: shard instant topk queries with count min sketch.

c917201

Make it compile

66dc3f0

Update range

215754c

Fix a few testing bugs

436302f

Calculate width and depth

6e53380

Format code

91133e7

Remove redundant return

9a07276

Keep a limited number of labels

6dab358

Save one metric to string

7849984

Convert limited label vector

c9726cc

Rename vector

947a924

Remove todo

33acd3b

Checkin heap test

9ac2030

Fix vector name

acb7754

jeschkies force-pushed the karsten/ptopk branch from 0a04574 to acb7754 Compare October 2, 2024 05:51

pull-request-size bot added size/XL and removed size/XXL labels Oct 2, 2024

jeschkies requested a review from cstyan October 3, 2024 12:21

jeschkies added 3 commits October 3, 2024 16:08

Satisfy linter

ac494bd

Define separate approx topk equivalence test

43cf741

Correct error string

ecc9050

periklis removed the sig/operator label Oct 4, 2024

cstyan and others added 5 commits October 4, 2024 15:01

Add more test cases to show how accurate the result for each item in the

24edc7b

topk actually is. Signed-off-by: Callum Styan <[email protected]>

Add another test case

681cd2e

Remove deadline

de7d83c

No timeout limit

2da4270

Test intersectiion

00a7e1f

trevorwhitney mentioned this pull request Oct 7, 2024

feat: aggregated metric volume queries #14412

Closed

6 tasks

cstyan reviewed Oct 7, 2024

View reviewed changes

jeschkies and others added 10 commits October 8, 2024 11:04

Remove test

524a10f

add hll cardinality estimate to CMS sketch and proto

026b573

Signed-off-by: Callum Styan <[email protected]>

Format imports

b122237

Merge branch 'main' into karsten/ptopk

782a148

include cardinality estimate in metrics.go logging and change CMS

0b015bf

counters to float64 so we can get float results in queries Signed-off-by: Callum Styan <[email protected]>

Merge branch 'main' into karsten/ptopk

149f990

Fix types in tests

09c3fe8

Fix heap test

b8d1cbb

Merge branch 'main' into karsten/ptopk

a8ccd9f

Update docs

76cf279

jeschkies requested a review from cstyan November 4, 2024 14:58

cstyan approved these changes Nov 4, 2024

View reviewed changes

cstyan merged commit 7b53f20 into grafana:main Nov 4, 2024
60 checks passed

cstyan added the backport k227 label Nov 4, 2024

loki-gh-app bot mentioned this pull request Nov 4, 2024

feat: Introduce shardable probabilistic topk for instant queries. (backport k227) #14765

Merged

7 tasks

jeschkies deleted the karsten/ptopk branch November 5, 2024 10:43

jeschkies mentioned this pull request Nov 28, 2024

chore: Document approx_topk keyword. #15179

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce shardable probabilistic topk for instant queries. #14243

feat: Introduce shardable probabilistic topk for instant queries. #14243

jeschkies commented Sep 24, 2024 •

edited

Loading

jeschkies Sep 24, 2024

cstyan Sep 25, 2024

jeschkies Sep 26, 2024

cstyan left a comment

cstyan Sep 25, 2024

jeschkies commented Sep 26, 2024

cstyan left a comment

cstyan Oct 1, 2024

jeschkies Oct 2, 2024

cstyan Oct 7, 2024

jeschkies Oct 8, 2024

		width := math.Ceil(math.E / epsilon)
		depth := math.Ceil(math.Log(1.0 / delta))

feat: Introduce shardable probabilistic topk for instant queries. #14243

feat: Introduce shardable probabilistic topk for instant queries. #14243

Conversation

jeschkies commented Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cstyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeschkies commented Sep 26, 2024

cstyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeschkies commented Sep 24, 2024 •

edited

Loading