-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support caching for series API #2202
Conversation
@owen-d A small nit. Will fix and ping you for the review |
11ccda6
to
0cea93d
Compare
pkg/querier/queryrange/codec.go
Outdated
func (r *LokiSeriesRequest) GetStep() int64 { | ||
return 0 | ||
return int64(time.Duration(int(math.Max(math.Floor(r.EndTs.Sub(r.StartTs).Seconds()/250), 1))) * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for cache
, returning just 0
would make it fail here with division by zero
https://github.com/cortexproject/cortex/blob/1c764096b1ea0956005da64c17271ce58b1c9ce4/pkg/querier/queryrange/results_cache.go#L424
I hope this default is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, I don't really love the idea of setting this here, but it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do 1 ? instead
@owen-d It can be reviewed now |
b760266
to
8d2ca6c
Compare
8d2ca6c
to
fedbe65
Compare
pkg/querier/queryrange/roundtrip.go
Outdated
queryCacheMiddleware, cache, err := queryrange.NewResultsCacheMiddleware( | ||
log, | ||
cfg.ResultsCacheConfig, | ||
cacheKeyLimits{limits}, | ||
limits, | ||
codec, | ||
extractor, | ||
nil, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create separate CacheMiddleware
for series
or use common CacheMiddleware
for both series
and metrics
?
If we create separate CacheMiddleware
, I am not sure how to handle stop
here
Line 316 in 268b7a3
t.stopper.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I covered this above, but you can make an adapter type that stops multiple caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure if it will work with the current cacheMiddleware it's possible.
I would wrote some test isolating only the middleware with couple of funky use cases and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will try to add few tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I have given some thoughts and I don't think you can use the current cache system for series. This is because within a given response you can't extract a subset of series since you don't know which series appears at what time.
This means you need to build your own cache system, but don't be afraid it's actually way much simpler that the cortex/query range one. You can only use a cache entry if the requested time range is wider than what you have in cache. Again you cannot extract a smaller subset from a cache entry.
Here is how I think it should work.
- for a given time range check if you have a cache entries (you can have multiple).
- if you don't just query the api and cache it storing the time range with it.
- if you do, you need to verify that the requested time range is bigger and that the current cache entry is within that time range (no overlap is allowed it can't overlap on another time range because again you can't extract a subset of series).
- use any possible entries and query missing entry.
- save all new and old entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add that this new cache can be used for labels too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to add few tests to verify series API cache results. It fails in this test
https://github.com/grafana/loki/blob/438bf456b0d1e4e527f523a22a6f20f0a55711f9/pkg/querier/queryrange/roundtrip_test.go#L325-L327
Build failure: https://cloud.drone.io/grafana/loki/3324/1/2
So, as you and @owen-d mentioned it won't return correct results for sub set time ranges.
First, I thought of just changing the GenerateCacheKey
to consider request.End
as well and everything else remains same. However, this will simply do a full request instead of a sub set request for large time ranges even when there is a subset data available in the cache.
Anyhow, the steps you mentioned are very clear. Will implement the cache code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ambivalent about the series cache extractor disregarding the start and end values. This has some correctness implications which may not be acceptable. For instance, the series endpoint does not include time ranges for when each label set (stream) was seen, only that said stream appeared during this time range. This means that cached results for superset ranges may return streams that don't exist when queried for a subset time range. Without this information, I'm not sure if we can correctly cache it.
@cyriltovena WDYT? Are the performance benefits worth the correctness concerns in your opinion?
) | ||
|
||
// SeriesExtractor implements Extractor interface | ||
type SeriesExtractor struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd probably rename this DefaultExtractor
as it's reusable and doesn't have any logic tied to series.
pkg/querier/queryrange/codec.go
Outdated
func (r *LokiSeriesRequest) GetStep() int64 { | ||
return 0 | ||
return int64(time.Duration(int(math.Max(math.Floor(r.EndTs.Sub(r.StartTs).Seconds()/250), 1))) * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, I don't really love the idea of setting this here, but it should be fine.
pkg/querier/queryrange/roundtrip.go
Outdated
@@ -64,7 +64,7 @@ func NewTripperware( | |||
return nil, nil, err | |||
} | |||
|
|||
seriesTripperware, err := NewSeriesTripperware(cfg, log, limits, lokiCodec, instrumentMetrics, retryMetrics, splitByMetrics) | |||
seriesTripperware, cache, err := NewSeriesTripperware(cfg, log, limits, lokiCodec, SeriesExtractor{}, instrumentMetrics, retryMetrics, splitByMetrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overwriting the previous metrics cache. You'll need to make a type MultiStopper []Stopper
type or similar in order to coalesce the multiple caches that are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will make the change
pkg/querier/queryrange/roundtrip.go
Outdated
queryCacheMiddleware, cache, err := queryrange.NewResultsCacheMiddleware( | ||
log, | ||
cfg.ResultsCacheConfig, | ||
cacheKeyLimits{limits}, | ||
limits, | ||
codec, | ||
extractor, | ||
nil, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I covered this above, but you can make an adapter type that stops multiple caches.
Associating |
014a47e
to
0b8584d
Compare
Going to need to hear @cyriltovena's opinions on this. I'm hesitant to add this as it hampers correctness. |
438bf45
to
aa22fd0
Compare
cc49d17
to
bd59fa0
Compare
bd59fa0
to
0c85a4c
Compare
Codecov Report
@@ Coverage Diff @@
## master #2202 +/- ##
==========================================
+ Coverage 62.14% 62.36% +0.22%
==========================================
Files 154 155 +1
Lines 12457 12669 +212
==========================================
+ Hits 7741 7901 +160
- Misses 4108 4137 +29
- Partials 608 631 +23
|
// check if cache freshness value is provided in legacy config | ||
maxCacheFreshness := s.cfg.LegacyMaxCacheFreshness | ||
if maxCacheFreshness == time.Duration(0) { | ||
maxCacheFreshness = s.limits.MaxCacheFreshness(userID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the user limits to override the the configuration parameter. I think you can extend our WithDefaultLimits
implementation to support LegacyMaxCacheFreshness
as well:
https://github.com/grafana/loki/blob/master/pkg/querier/queryrange/limits.go#L30-L45
"github.com/weaveworks/common/user" | ||
) | ||
|
||
type lokiResultsCache struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might add in the Future log result cache so I'd say we should name this one, metadata cache ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
// Refreshes the cache if it contains an old proto schema. | ||
for _, e := range resp.Extents { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can merge this due to the correctness issues and I think it will require a lot of additional work: We'll need to make the metadata endpoints time-aware in the sense that they return the time ranges that each series/label is "alive" for. After adding that, we'll need to thread those timestamps through to the cache as well as try to ensure our implementation is prometheus compatible, which doesn't include these timestamps. I think one could make an argument for this correctness issue being "close enough", but that's not where I stand and I haven't heard anyone voicing this opinion.
Also, please don't force push unless necessary. It ends up masking what has changed since the last version of the PR, making it difficult to give incremental reviews.
cfg queryrange.ResultsCacheConfig | ||
next queryrange.Handler | ||
cache cache.Cache | ||
limits Limits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably the queryrange.Limits
from cortex -- it doesn't look like you're using any of the superset methods from the loki Limits
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limits interface GenerateCacheKey
is used
loki/pkg/querier/queryrange/limits.go
Line 54 in 0c85a4c
func (l cacheKeyLimits) GenerateCacheKey(userID string, r queryrange.Request) string { |
// Main idea to implement custom cache middleware is that with cortex's cache middleware | ||
// code, we can't correctly cache "series" and "label" responses since they doesn't have a | ||
// time range associtated with it. This custom code addresses the limitation. | ||
func NewLokiResultsCacheMiddleware( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having difficulty finding where this differs from the cortex version -- can you help me out? Also, would it be worth exposing the relevant parts in cortex rather than copy/pasting them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the logic is just copy/paste. Difference is in partition
logic. In this implementation of cache, we consider the extent
only if it within the request range, we don't consider any overlap. For any other request time range which doesn't match this condition we make a new request.
loki/pkg/querier/queryrange/cache.go
Line 187 in 0c85a4c
if start <= extent.GetStart() && req.GetEnd() >= extent.GetEnd() { |
Whereas cortex just ignores the extents which doesn't overlap
https://github.com/cortexproject/cortex/blob/d16b68152befaabe7c65f501371f20470d0a7bb4/pkg/querier/queryrange/results_cache.go#L397
Agree that we can expose required methods from Cortex
@cyriltovena @owen-d What's the take on this? Should we make changes in Cortex and use it in Loki or Should we have our own implementation in Loki? |
I'm ok to start with a copy of code and then attempt to retrofit in Cortex. I just didn't had the time yet to give a good review. |
I'm unsure about what to do yet. I think we may want to defer this decision for a while and see how much benefit we can gain just from parallelization (without caching). We're stuck between two unfortunate alternatives: breaking prometheus api compatibility and not having caching here. |
@owen-d I will |
Thank you :) |
1 similar comment
Thank you :) |
Closing this as no activities for long time. Feel free to send new PR if anyone want's to revive the work |
What this PR does / why we need it:
Support caching for series API. This PR also adds custom cache middleware for Loki which supports caching for
series
andlabel
APIsWhich issue(s) this PR fixes:
Fixes #2168
Checklist