Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support caching for series API #2202

Closed
wants to merge 2 commits into from

Conversation

adityacs
Copy link
Contributor

@adityacs adityacs commented Jun 9, 2020

What this PR does / why we need it:
Support caching for series API. This PR also adds custom cache middleware for Loki which supports caching for series and label APIs

Which issue(s) this PR fixes:
Fixes #2168

Checklist

  • Tests updated

@adityacs adityacs requested a review from owen-d June 9, 2020 18:00
@adityacs adityacs marked this pull request as draft June 9, 2020 18:03
@adityacs
Copy link
Contributor Author

adityacs commented Jun 9, 2020

@owen-d A small nit. Will fix and ping you for the review

@adityacs adityacs force-pushed the series_api_cache branch 2 times, most recently from 11ccda6 to 0cea93d Compare June 9, 2020 18:24
@adityacs adityacs marked this pull request as ready for review June 9, 2020 18:24
Comment on lines 87 to 102
func (r *LokiSeriesRequest) GetStep() int64 {
return 0
return int64(time.Duration(int(math.Max(math.Floor(r.EndTs.Sub(r.StartTs).Seconds()/250), 1))) * time.Millisecond)
Copy link
Contributor Author

@adityacs adityacs Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for cache, returning just 0 would make it fail here with division by zero https://github.com/cortexproject/cortex/blob/1c764096b1ea0956005da64c17271ce58b1c9ce4/pkg/querier/queryrange/results_cache.go#L424

I hope this default is fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I don't really love the idea of setting this here, but it should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do 1 ? instead

@adityacs
Copy link
Contributor Author

adityacs commented Jun 9, 2020

@owen-d It can be reviewed now

@adityacs adityacs force-pushed the series_api_cache branch 2 times, most recently from b760266 to 8d2ca6c Compare June 10, 2020 03:15
Comment on lines 251 to 259
queryCacheMiddleware, cache, err := queryrange.NewResultsCacheMiddleware(
log,
cfg.ResultsCacheConfig,
cacheKeyLimits{limits},
limits,
codec,
extractor,
nil,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create separate CacheMiddleware for series or use common CacheMiddleware for both series and metrics ?

If we create separate CacheMiddleware, I am not sure how to handle stop here

t.stopper.Stop()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I covered this above, but you can make an adapter type that stops multiple caches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure if it will work with the current cacheMiddleware it's possible.

I would wrote some test isolating only the middleware with couple of funky use cases and see what happens.

Copy link
Contributor Author

@adityacs adityacs Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will try to add few tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I have given some thoughts and I don't think you can use the current cache system for series. This is because within a given response you can't extract a subset of series since you don't know which series appears at what time.

This means you need to build your own cache system, but don't be afraid it's actually way much simpler that the cortex/query range one. You can only use a cache entry if the requested time range is wider than what you have in cache. Again you cannot extract a smaller subset from a cache entry.

Here is how I think it should work.

  • for a given time range check if you have a cache entries (you can have multiple).
  • if you don't just query the api and cache it storing the time range with it.
  • if you do, you need to verify that the requested time range is bigger and that the current cache entry is within that time range (no overlap is allowed it can't overlap on another time range because again you can't extract a subset of series).
  • use any possible entries and query missing entry.
  • save all new and old entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that this new cache can be used for labels too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to add few tests to verify series API cache results. It fails in this test
https://github.com/grafana/loki/blob/438bf456b0d1e4e527f523a22a6f20f0a55711f9/pkg/querier/queryrange/roundtrip_test.go#L325-L327

Build failure: https://cloud.drone.io/grafana/loki/3324/1/2

So, as you and @owen-d mentioned it won't return correct results for sub set time ranges.

First, I thought of just changing the GenerateCacheKey to consider request.End as well and everything else remains same. However, this will simply do a full request instead of a sub set request for large time ranges even when there is a subset data available in the cache.

Anyhow, the steps you mentioned are very clear. Will implement the cache code accordingly.

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent about the series cache extractor disregarding the start and end values. This has some correctness implications which may not be acceptable. For instance, the series endpoint does not include time ranges for when each label set (stream) was seen, only that said stream appeared during this time range. This means that cached results for superset ranges may return streams that don't exist when queried for a subset time range. Without this information, I'm not sure if we can correctly cache it.

@cyriltovena WDYT? Are the performance benefits worth the correctness concerns in your opinion?

)

// SeriesExtractor implements Extractor interface
type SeriesExtractor struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd probably rename this DefaultExtractor as it's reusable and doesn't have any logic tied to series.

Comment on lines 87 to 102
func (r *LokiSeriesRequest) GetStep() int64 {
return 0
return int64(time.Duration(int(math.Max(math.Floor(r.EndTs.Sub(r.StartTs).Seconds()/250), 1))) * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I don't really love the idea of setting this here, but it should be fine.

@@ -64,7 +64,7 @@ func NewTripperware(
return nil, nil, err
}

seriesTripperware, err := NewSeriesTripperware(cfg, log, limits, lokiCodec, instrumentMetrics, retryMetrics, splitByMetrics)
seriesTripperware, cache, err := NewSeriesTripperware(cfg, log, limits, lokiCodec, SeriesExtractor{}, instrumentMetrics, retryMetrics, splitByMetrics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overwriting the previous metrics cache. You'll need to make a type MultiStopper []Stopper type or similar in order to coalesce the multiple caches that are returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will make the change

Comment on lines 251 to 259
queryCacheMiddleware, cache, err := queryrange.NewResultsCacheMiddleware(
log,
cfg.ResultsCacheConfig,
cacheKeyLimits{limits},
limits,
codec,
extractor,
nil,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I covered this above, but you can make an adapter type that stops multiple caches.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jun 13, 2020
pkg/loki/modules.go Outdated Show resolved Hide resolved
@adityacs
Copy link
Contributor Author

Associating series with time range would be a good idea. I am not sure how easy it is because a particular series can be found across multiple time ranges.

@adityacs adityacs force-pushed the series_api_cache branch 2 times, most recently from 014a47e to 0b8584d Compare June 13, 2020 03:49
@owen-d
Copy link
Member

owen-d commented Jun 19, 2020

Going to need to hear @cyriltovena's opinions on this. I'm hesitant to add this as it hampers correctness.

@adityacs adityacs force-pushed the series_api_cache branch 2 times, most recently from 438bf45 to aa22fd0 Compare June 25, 2020 09:42
@adityacs adityacs force-pushed the series_api_cache branch 4 times, most recently from cc49d17 to bd59fa0 Compare June 25, 2020 10:29
@adityacs adityacs requested review from cyriltovena and owen-d June 25, 2020 10:39
@codecov-commenter
Copy link

Codecov Report

Merging #2202 into master will increase coverage by 0.22%.
The diff coverage is 69.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2202      +/-   ##
==========================================
+ Coverage   62.14%   62.36%   +0.22%     
==========================================
  Files         154      155       +1     
  Lines       12457    12669     +212     
==========================================
+ Hits         7741     7901     +160     
- Misses       4108     4137      +29     
- Partials      608      631      +23     
Impacted Files Coverage Δ
pkg/loki/loki.go 0.00% <ø> (ø)
pkg/loki/modules.go 8.87% <0.00%> (-0.13%) ⬇️
pkg/querier/queryrange/split_by_interval.go 87.59% <ø> (ø)
pkg/querier/queryrange/cache.go 69.47% <69.47%> (ø)
pkg/querier/queryrange/roundtrip.go 71.50% <91.66%> (+1.88%) ⬆️
pkg/querier/queryrange/codec.go 93.20% <100.00%> (+2.44%) ⬆️
pkg/querier/queryrange/downstreamer.go 95.87% <0.00%> (-2.07%) ⬇️
pkg/promtail/targets/tailer.go 78.40% <0.00%> (+4.54%) ⬆️

// check if cache freshness value is provided in legacy config
maxCacheFreshness := s.cfg.LegacyMaxCacheFreshness
if maxCacheFreshness == time.Duration(0) {
maxCacheFreshness = s.limits.MaxCacheFreshness(userID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the user limits to override the the configuration parameter. I think you can extend our WithDefaultLimits implementation to support LegacyMaxCacheFreshness as well:
https://github.com/grafana/loki/blob/master/pkg/querier/queryrange/limits.go#L30-L45

"github.com/weaveworks/common/user"
)

type lokiResultsCache struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might add in the Future log result cache so I'd say we should name this one, metadata cache ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

// Refreshes the cache if it contains an old proto schema.
for _, e := range resp.Extents {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this.

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can merge this due to the correctness issues and I think it will require a lot of additional work: We'll need to make the metadata endpoints time-aware in the sense that they return the time ranges that each series/label is "alive" for. After adding that, we'll need to thread those timestamps through to the cache as well as try to ensure our implementation is prometheus compatible, which doesn't include these timestamps. I think one could make an argument for this correctness issue being "close enough", but that's not where I stand and I haven't heard anyone voicing this opinion.

Also, please don't force push unless necessary. It ends up masking what has changed since the last version of the PR, making it difficult to give incremental reviews.

cfg queryrange.ResultsCacheConfig
next queryrange.Handler
cache cache.Cache
limits Limits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably the queryrange.Limits from cortex -- it doesn't look like you're using any of the superset methods from the loki Limits interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limits interface GenerateCacheKey is used

func (l cacheKeyLimits) GenerateCacheKey(userID string, r queryrange.Request) string {

// Main idea to implement custom cache middleware is that with cortex's cache middleware
// code, we can't correctly cache "series" and "label" responses since they doesn't have a
// time range associtated with it. This custom code addresses the limitation.
func NewLokiResultsCacheMiddleware(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having difficulty finding where this differs from the cortex version -- can you help me out? Also, would it be worth exposing the relevant parts in cortex rather than copy/pasting them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the logic is just copy/paste. Difference is in partition logic. In this implementation of cache, we consider the extent only if it within the request range, we don't consider any overlap. For any other request time range which doesn't match this condition we make a new request.

if start <= extent.GetStart() && req.GetEnd() >= extent.GetEnd() {

Whereas cortex just ignores the extents which doesn't overlap
https://github.com/cortexproject/cortex/blob/d16b68152befaabe7c65f501371f20470d0a7bb4/pkg/querier/queryrange/results_cache.go#L397

Agree that we can expose required methods from Cortex

@adityacs
Copy link
Contributor Author

adityacs commented Jul 3, 2020

@cyriltovena @owen-d What's the take on this? Should we make changes in Cortex and use it in Loki or Should we have our own implementation in Loki?

@cyriltovena
Copy link
Contributor

@cyriltovena @owen-d What's the take on this? Should we make changes in Cortex and use it in Loki or Should we have our own implementation in Loki?

I'm ok to start with a copy of code and then attempt to retrofit in Cortex. I just didn't had the time yet to give a good review.

@owen-d
Copy link
Member

owen-d commented Jul 29, 2020

I'm unsure about what to do yet. I think we may want to defer this decision for a while and see how much benefit we can gain just from parallelization (without caching). We're stuck between two unfortunate alternatives: breaking prometheus api compatibility and not having caching here.

@adityacs
Copy link
Contributor Author

@owen-d I will keepalive this for now. Later when we get a chance we can revisit this.

@adityacs adityacs added the keepalive An issue or PR that will be kept alive and never marked as stale. label Jul 31, 2020
@owen-d
Copy link
Member

owen-d commented Jul 31, 2020

Thank you :)

1 similar comment
@owen-d
Copy link
Member

owen-d commented Jul 31, 2020

Thank you :)

@CLAassistant
Copy link

CLAassistant commented Apr 20, 2021

CLA assistant check
All committers have signed the CLA.

@kavirajk
Copy link
Contributor

Closing this as no activities for long time. Feel free to send new PR if anyone want's to revive the work

@kavirajk kavirajk closed this Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive An issue or PR that will be kept alive and never marked as stale. size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series API should be cacheable
6 participants