-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move GlobalRefMap into service instead of global var. #5489
Conversation
ugh fixing some things when i blindly hit update branch from github ui |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that something like this is global. Isn't there going to be interference between components in certain edge cases? For example, if we have two different scrape -> remote_write pipelines, where there we scrape different endpoints but the metric names and labels are the same? In that case, for example staleness marker won't be applied properly if one series goes stale but the other isn't, and both are series with the same hash but actually come from different endpoints?
if err != nil { | ||
return nil, err | ||
} | ||
ls := service.(labelstore.LabelStore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this label store still valid even if the config is reloaded? Should we use a new label store instead, and update it in the component's Update
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Services require that items returned from service.data are consistent throughout the lifetime of the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there are situations when this would fail. For example, this scenario might be an issue:
- An Agent does scrape -> remote_write.
- The config is reloaded to scrape a different endpoint, but this other endpoint has the same labels and metrics (so the series look the same).
- At that point, the map would get confused and think that it's still tracking the old series.
I'm not sure if this has any adverse effects, but seems like a weird side-effect. There might be similar side-effects if we reload the config and point it to a different remote_write endpoint too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be fine since in the current iteration they are the same to it and should be. At the heart its a int to string mapping and a int to local remote write int mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be fine since in the current iteration they are the same to it and should be.
I don't follow. Why using a stale LocalRefID, from the previous configuration of a remote_write is not a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above scenario its the scraper that is changing not the remote_write. In the case of remote_write changing, I don't see anywhere we reset the walStorage in remote_write component on the updating so the local id cache should stay the same.
} | ||
|
||
labelHash := lbls.Hash() | ||
globalID, found := s.labelsHashToGlobal[labelHash] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle hash collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not but its what we have been running for a year or so.
There should be no change to any underlying behavior. Beforehand the globalrefmap was a global var but instead moving it into a service. Its possible in the above scenario with two scrape loops for there to be overlap, though its not an issue today because we haven't wired in staleness markers. Its possible it would be an issue when we do wire them in. |
service/labelstore/data.go
Outdated
type LabelStore interface { | ||
GetOrAddLink(componentID string, localRefID uint64, lbls labels.Labels) uint64 | ||
GetOrAddGlobalRefID(l labels.Labels) uint64 | ||
GetGlobalRefID(componentID string, localRefID uint64) uint64 | ||
GetLocalRefID(componentID string, globalRefID uint64) uint64 | ||
AddStaleMarker(globalRefID uint64, l labels.Labels) | ||
RemoveStaleMarker(globalRefID uint64) | ||
CheckStaleMarkers() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some comments on this interface for clarity what this is for and when to use the methods? I think some implementation methods already have comments that could be a starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good will do.
} | ||
|
||
// staleDuration determines how often we should wait after a stale value is received to GC that value | ||
var staleDuration = time.Minute * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we take this opportunity and make this configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it the code path is unused and trying to avoid changing behavior would like to push this to another pr.
service/labelstore/service.go
Outdated
var staleDuration = time.Minute * 10 | ||
|
||
// CheckStaleMarkers is called to garbage collect and items that have grown stale over stale duration (10m) | ||
func (s *service) CheckStaleMarkers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckStaleMarkers
doesn't sound like a good name IMO - doesn't reflect that this will delete old markers and garbage-collect effectively. We can call it just GarbageCollection()
as it may serve multiple purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about CheckAndRemoveStaleMarkers, I feel GarbageCollection is a bit to general at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Or shorter CleanUpStaleMarkers
.
service/labelstore/service.go
Outdated
var staleDuration = time.Minute * 10 | ||
|
||
// CheckStaleMarkers is called to garbage collect and items that have grown stale over stale duration (10m) | ||
func (s *service) CheckStaleMarkers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to only be called from a test. Shouldn't we have something calling this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should but avoiding it for this PR since this is fork lifting functionality with ideally zero change.
if err != nil { | ||
return nil, err | ||
} | ||
ls := service.(labelstore.LabelStore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be fine since in the current iteration they are the same to it and should be.
I don't follow. Why using a stale LocalRefID, from the previous configuration of a remote_write is not a problem?
@@ -103,10 +112,10 @@ func New(o component.Options, c Arguments) (*Component, error) { | |||
return 0, fmt.Errorf("%s has exited", o.ID) | |||
} | |||
|
|||
localID := prometheus.GlobalRefMapping.GetLocalRefID(res.opts.ID, uint64(globalRef)) | |||
localID := ls.GetLocalRefID(res.opts.ID, uint64(globalRef)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This localID
may be for a previous configuration of the component and may no longer be valid. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be an issue with queue manager and seriesLabels but I think to fix that would require some Prometheus changes. Worst case a series would drop until the wal watcher reread a segment with them. This would also be a bug in our current incarnation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, that would already be a bug :) Maybe we could prevent it by adding a component 'incarnation' ID to the current component ID, so they don't reuse labels when config changes. Let's park this discussion for the future.
Co-authored-by: Piotr <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
FYI created #5585 to handle wiring up garbage collection. |
PR Description
This is mostly a shift from global var to service with ideally no change in behavior.
PR Checklist