Support caching for series API #2202

adityacs · 2020-06-09T18:00:14Z

What this PR does / why we need it:
Support caching for series API. This PR also adds custom cache middleware for Loki which supports caching for series and label APIs

Which issue(s) this PR fixes:
Fixes #2168

Checklist

Tests updated

adityacs · 2020-06-09T18:04:11Z

@owen-d A small nit. Will fix and ping you for the review

adityacs · 2020-06-09T18:26:09Z

pkg/querier/queryrange/codec.go

 func (r *LokiSeriesRequest) GetStep() int64 {
-	return 0
+	return int64(time.Duration(int(math.Max(math.Floor(r.EndTs.Sub(r.StartTs).Seconds()/250), 1))) * time.Millisecond)


This is required for cache, returning just 0 would make it fail here with division by zero https://github.com/cortexproject/cortex/blob/1c764096b1ea0956005da64c17271ce58b1c9ce4/pkg/querier/queryrange/results_cache.go#L424

I hope this default is fine

Hrm, I don't really love the idea of setting this here, but it should be fine.

Can you do 1 ? instead

adityacs · 2020-06-09T18:27:07Z

@owen-d It can be reviewed now

adityacs · 2020-06-10T06:53:21Z

pkg/querier/queryrange/roundtrip.go

+		queryCacheMiddleware, cache, err := queryrange.NewResultsCacheMiddleware(
+			log,
+			cfg.ResultsCacheConfig,
+			cacheKeyLimits{limits},
+			limits,
+			codec,
+			extractor,
+			nil,
+		)


Should we create separate CacheMiddleware for series or use common CacheMiddleware for both series and metrics ?

If we create separate CacheMiddleware, I am not sure how to handle stop here

loki/pkg/loki/modules.go

Line 316 in 268b7a3

t.stopper.Stop()

I covered this above, but you can make an adapter type that stops multiple caches.

I'm actually not sure if it will work with the current cacheMiddleware it's possible.

I would wrote some test isolating only the middleware with couple of funky use cases and see what happens.

Sure, will try to add few tests

Well I have given some thoughts and I don't think you can use the current cache system for series. This is because within a given response you can't extract a subset of series since you don't know which series appears at what time.

This means you need to build your own cache system, but don't be afraid it's actually way much simpler that the cortex/query range one. You can only use a cache entry if the requested time range is wider than what you have in cache. Again you cannot extract a smaller subset from a cache entry.

Here is how I think it should work.

for a given time range check if you have a cache entries (you can have multiple).

if you don't just query the api and cache it storing the time range with it.

if you do, you need to verify that the requested time range is bigger and that the current cache entry is within that time range (no overlap is allowed it can't overlap on another time range because again you can't extract a subset of series).

use any possible entries and query missing entry.

save all new and old entries.

I should add that this new cache can be used for labels too.

I was trying to add few tests to verify series API cache results. It fails in this test
https://github.com/grafana/loki/blob/438bf456b0d1e4e527f523a22a6f20f0a55711f9/pkg/querier/queryrange/roundtrip_test.go#L325-L327

Build failure: https://cloud.drone.io/grafana/loki/3324/1/2

So, as you and @owen-d mentioned it won't return correct results for sub set time ranges.

First, I thought of just changing the GenerateCacheKey to consider request.End as well and everything else remains same. However, this will simply do a full request instead of a sub set request for large time ranges even when there is a subset data available in the cache.

Anyhow, the steps you mentioned are very clear. Will implement the cache code accordingly.

owen-d

I'm ambivalent about the series cache extractor disregarding the start and end values. This has some correctness implications which may not be acceptable. For instance, the series endpoint does not include time ranges for when each label set (stream) was seen, only that said stream appeared during this time range. This means that cached results for superset ranges may return streams that don't exist when queried for a subset time range. Without this information, I'm not sure if we can correctly cache it.

@cyriltovena WDYT? Are the performance benefits worth the correctness concerns in your opinion?

owen-d · 2020-06-11T15:09:44Z

pkg/querier/queryrange/cache_extractor.go

+)
+
+// SeriesExtractor implements Extractor interface
+type SeriesExtractor struct{}


nit: I'd probably rename this DefaultExtractor as it's reusable and doesn't have any logic tied to series.

owen-d · 2020-06-11T15:11:20Z

pkg/querier/queryrange/codec.go

 func (r *LokiSeriesRequest) GetStep() int64 {
-	return 0
+	return int64(time.Duration(int(math.Max(math.Floor(r.EndTs.Sub(r.StartTs).Seconds()/250), 1))) * time.Millisecond)


Hrm, I don't really love the idea of setting this here, but it should be fine.

owen-d · 2020-06-11T15:14:58Z

pkg/querier/queryrange/roundtrip.go

@@ -64,7 +64,7 @@ func NewTripperware(
 		return nil, nil, err
 	}

-	seriesTripperware, err := NewSeriesTripperware(cfg, log, limits, lokiCodec, instrumentMetrics, retryMetrics, splitByMetrics)
+	seriesTripperware, cache, err := NewSeriesTripperware(cfg, log, limits, lokiCodec, SeriesExtractor{}, instrumentMetrics, retryMetrics, splitByMetrics)


This is overwriting the previous metrics cache. You'll need to make a type MultiStopper []Stopper type or similar in order to coalesce the multiple caches that are returned.

Sure. Will make the change

owen-d · 2020-06-11T15:27:24Z

pkg/querier/queryrange/roundtrip.go

+		queryCacheMiddleware, cache, err := queryrange.NewResultsCacheMiddleware(
+			log,
+			cfg.ResultsCacheConfig,
+			cacheKeyLimits{limits},
+			limits,
+			codec,
+			extractor,
+			nil,
+		)


I covered this above, but you can make an adapter type that stops multiple caches.

pkg/loki/modules.go

adityacs · 2020-06-13T03:14:35Z

Associating series with time range would be a good idea. I am not sure how easy it is because a particular series can be found across multiple time ranges.

owen-d · 2020-06-19T12:27:22Z

Going to need to hear @cyriltovena's opinions on this. I'm hesitant to add this as it hampers correctness.

codecov-commenter · 2020-06-25T10:59:23Z

Codecov Report

Merging #2202 into master will increase coverage by 0.22%.
The diff coverage is 69.81%.

@@            Coverage Diff             @@
##           master    #2202      +/-   ##
==========================================
+ Coverage   62.14%   62.36%   +0.22%     
==========================================
  Files         154      155       +1     
  Lines       12457    12669     +212     
==========================================
+ Hits         7741     7901     +160     
- Misses       4108     4137      +29     
- Partials      608      631      +23

Impacted Files	Coverage Δ
pkg/loki/loki.go	`0.00% <ø> (ø)`
pkg/loki/modules.go	`8.87% <0.00%> (-0.13%)`	⬇️
pkg/querier/queryrange/split_by_interval.go	`87.59% <ø> (ø)`
pkg/querier/queryrange/cache.go	`69.47% <69.47%> (ø)`
pkg/querier/queryrange/roundtrip.go	`71.50% <91.66%> (+1.88%)`	⬆️
pkg/querier/queryrange/codec.go	`93.20% <100.00%> (+2.44%)`	⬆️
pkg/querier/queryrange/downstreamer.go	`95.87% <0.00%> (-2.07%)`	⬇️
pkg/promtail/targets/tailer.go	`78.40% <0.00%> (+4.54%)`	⬆️

owen-d · 2020-06-29T12:57:16Z

pkg/querier/queryrange/cache.go

+	// check if cache freshness value is provided in legacy config
+	maxCacheFreshness := s.cfg.LegacyMaxCacheFreshness
+	if maxCacheFreshness == time.Duration(0) {
+		maxCacheFreshness = s.limits.MaxCacheFreshness(userID)


I think we want the user limits to override the the configuration parameter. I think you can extend our WithDefaultLimits implementation to support LegacyMaxCacheFreshness as well:
https://github.com/grafana/loki/blob/master/pkg/querier/queryrange/limits.go#L30-L45

cyriltovena · 2020-06-29T15:51:49Z

pkg/querier/queryrange/cache.go

+	"github.com/weaveworks/common/user"
+)
+
+type lokiResultsCache struct {


We might add in the Future log result cache so I'd say we should name this one, metadata cache ?

cyriltovena · 2020-06-29T15:53:35Z

pkg/querier/queryrange/cache.go

+	}
+
+	// Refreshes the cache if it contains an old proto schema.
+	for _, e := range resp.Extents {


You don't need this.

owen-d

I'm not sure we can merge this due to the correctness issues and I think it will require a lot of additional work: We'll need to make the metadata endpoints time-aware in the sense that they return the time ranges that each series/label is "alive" for. After adding that, we'll need to thread those timestamps through to the cache as well as try to ensure our implementation is prometheus compatible, which doesn't include these timestamps. I think one could make an argument for this correctness issue being "close enough", but that's not where I stand and I haven't heard anyone voicing this opinion.

Also, please don't force push unless necessary. It ends up masking what has changed since the last version of the PR, making it difficult to give incremental reviews.

owen-d · 2020-06-29T17:17:55Z

pkg/querier/queryrange/cache.go

+	cfg      queryrange.ResultsCacheConfig
+	next     queryrange.Handler
+	cache    cache.Cache
+	limits   Limits


This should probably the queryrange.Limits from cortex -- it doesn't look like you're using any of the superset methods from the loki Limits interface.

Limits interface GenerateCacheKey is used

loki/pkg/querier/queryrange/limits.go

Line 54 in 0c85a4c

func (l cacheKeyLimits) GenerateCacheKey(userID string, r queryrange.Request) string {

owen-d · 2020-06-29T17:21:05Z

pkg/querier/queryrange/cache.go

+// Main idea to implement custom cache middleware is that with cortex's cache middleware
+// code, we can't correctly cache "series" and "label" responses since they doesn't have a
+// time range associtated with it. This custom code addresses the limitation.
+func NewLokiResultsCacheMiddleware(


I'm having difficulty finding where this differs from the cortex version -- can you help me out? Also, would it be worth exposing the relevant parts in cortex rather than copy/pasting them here?

Most of the logic is just copy/paste. Difference is in partition logic. In this implementation of cache, we consider the extent only if it within the request range, we don't consider any overlap. For any other request time range which doesn't match this condition we make a new request.

loki/pkg/querier/queryrange/cache.go

Line 187 in 0c85a4c

if start <= extent.GetStart() && req.GetEnd() >= extent.GetEnd() {

Whereas cortex just ignores the extents which doesn't overlap
https://github.com/cortexproject/cortex/blob/d16b68152befaabe7c65f501371f20470d0a7bb4/pkg/querier/queryrange/results_cache.go#L397

Agree that we can expose required methods from Cortex

adityacs · 2020-07-03T10:54:17Z

@cyriltovena @owen-d What's the take on this? Should we make changes in Cortex and use it in Loki or Should we have our own implementation in Loki?

cyriltovena · 2020-07-03T12:48:09Z

@cyriltovena @owen-d What's the take on this? Should we make changes in Cortex and use it in Loki or Should we have our own implementation in Loki?

I'm ok to start with a copy of code and then attempt to retrofit in Cortex. I just didn't had the time yet to give a good review.

owen-d · 2020-07-29T14:47:18Z

I'm unsure about what to do yet. I think we may want to defer this decision for a while and see how much benefit we can gain just from parallelization (without caching). We're stuck between two unfortunate alternatives: breaking prometheus api compatibility and not having caching here.

adityacs · 2020-07-31T05:10:52Z

@owen-d I will keepalive this for now. Later when we get a chance we can revisit this.

owen-d · 2020-07-31T11:01:59Z

Thank you :)

owen-d · 2020-07-31T11:02:04Z

Thank you :)

CLAassistant · 2021-04-20T17:30:40Z

All committers have signed the CLA.

kavirajk · 2022-03-18T13:54:51Z

Closing this as no activities for long time. Feel free to send new PR if anyone want's to revive the work

pull-request-size bot added the size/M label Jun 9, 2020

adityacs requested a review from owen-d June 9, 2020 18:00

adityacs marked this pull request as draft June 9, 2020 18:03

adityacs force-pushed the series_api_cache branch 2 times, most recently from 11ccda6 to 0cea93d Compare June 9, 2020 18:24

adityacs marked this pull request as ready for review June 9, 2020 18:24

adityacs commented Jun 9, 2020

View reviewed changes

adityacs force-pushed the series_api_cache branch 2 times, most recently from b760266 to 8d2ca6c Compare June 10, 2020 03:15

Support caching for series API

fedbe65

adityacs force-pushed the series_api_cache branch from 8d2ca6c to fedbe65 Compare June 10, 2020 03:17

adityacs commented Jun 10, 2020

View reviewed changes

owen-d reviewed Jun 11, 2020

View reviewed changes

pull-request-size bot added size/L and removed size/M labels Jun 13, 2020

adityacs commented Jun 13, 2020

View reviewed changes

pkg/loki/modules.go Outdated Show resolved Hide resolved

adityacs force-pushed the series_api_cache branch 2 times, most recently from 014a47e to 0b8584d Compare June 13, 2020 03:49

adityacs force-pushed the series_api_cache branch 2 times, most recently from 438bf45 to aa22fd0 Compare June 25, 2020 09:42

pull-request-size bot added size/XL and removed size/L labels Jun 25, 2020

adityacs force-pushed the series_api_cache branch 4 times, most recently from cc49d17 to bd59fa0 Compare June 25, 2020 10:29

loki cache middleware

0c85a4c

adityacs force-pushed the series_api_cache branch from bd59fa0 to 0c85a4c Compare June 25, 2020 10:39

adityacs requested review from cyriltovena and owen-d June 25, 2020 10:39

owen-d reviewed Jun 29, 2020

View reviewed changes

cyriltovena reviewed Jun 29, 2020

View reviewed changes

owen-d reviewed Jun 29, 2020

View reviewed changes

adityacs mentioned this pull request Jul 29, 2020

Split label names queries in the frontend. #2441

Merged

adityacs added the keepalive An issue or PR that will be kept alive and never marked as stale. label Jul 31, 2020

yeya24 mentioned this pull request Nov 5, 2020

Action Plan for Thanos query-frontend (Response Caching, Splitting queries) thanos-io/thanos#2454

Closed

kavirajk closed this Mar 18, 2022

Support caching for series API #2202

Support caching for series API #2202

Conversation

adityacs commented Jun 9, 2020 • edited Loading

adityacs commented Jun 9, 2020

adityacs Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityacs commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityacs Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owen-d left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityacs commented Jun 13, 2020

owen-d commented Jun 19, 2020

codecov-commenter commented Jun 25, 2020

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owen-d left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adityacs commented Jul 3, 2020

cyriltovena commented Jul 3, 2020

owen-d commented Jul 29, 2020

adityacs commented Jul 31, 2020

owen-d commented Jul 31, 2020

owen-d commented Jul 31, 2020

CLAassistant commented Apr 20, 2021 • edited Loading

kavirajk commented Mar 18, 2022

adityacs commented Jun 9, 2020 •

edited

Loading

adityacs Jun 9, 2020 •

edited

Loading

adityacs Jun 17, 2020 •

edited

Loading

owen-d left a comment •

edited

Loading

CLAassistant commented Apr 20, 2021 •

edited

Loading