Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move GlobalRefMap into service instead of global var. #5489

Merged
merged 17 commits into from
Oct 24, 2023

Conversation

mattdurham
Copy link
Collaborator

PR Description

This is mostly a shift from global var to service with ideally no change in behavior.

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@mattdurham mattdurham changed the title WIP: Move GlobalRefMap into service instead of global var. Move GlobalRefMap into service instead of global var. Oct 16, 2023
@mattdurham mattdurham marked this pull request as ready for review October 16, 2023 14:17
@mattdurham mattdurham requested review from a team as code owners October 16, 2023 14:17
@mattdurham
Copy link
Collaborator Author

ugh fixing some things when i blindly hit update branch from github ui

Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that something like this is global. Isn't there going to be interference between components in certain edge cases? For example, if we have two different scrape -> remote_write pipelines, where there we scrape different endpoints but the metric names and labels are the same? In that case, for example staleness marker won't be applied properly if one series goes stale but the other isn't, and both are series with the same hash but actually come from different endpoints?

if err != nil {
return nil, err
}
ls := service.(labelstore.LabelStore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this label store still valid even if the config is reloaded? Should we use a new label store instead, and update it in the component's Update function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Services require that items returned from service.data are consistent throughout the lifetime of the process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there are situations when this would fail. For example, this scenario might be an issue:

  1. An Agent does scrape -> remote_write.
  2. The config is reloaded to scrape a different endpoint, but this other endpoint has the same labels and metrics (so the series look the same).
  3. At that point, the map would get confused and think that it's still tracking the old series.

I'm not sure if this has any adverse effects, but seems like a weird side-effect. There might be similar side-effects if we reload the config and point it to a different remote_write endpoint too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be fine since in the current iteration they are the same to it and should be. At the heart its a int to string mapping and a int to local remote write int mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be fine since in the current iteration they are the same to it and should be.

I don't follow. Why using a stale LocalRefID, from the previous configuration of a remote_write is not a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above scenario its the scraper that is changing not the remote_write. In the case of remote_write changing, I don't see anywhere we reset the walStorage in remote_write component on the updating so the local id cache should stay the same.

}

labelHash := lbls.Hash()
globalID, found := s.labelsHashToGlobal[labelHash]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle hash collisions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not but its what we have been running for a year or so.

@mattdurham
Copy link
Collaborator Author

There should be no change to any underlying behavior. Beforehand the globalrefmap was a global var but instead moving it into a service. Its possible in the above scenario with two scrape loops for there to be overlap, though its not an issue today because we haven't wired in staleness markers. Its possible it would be an issue when we do wire them in.

pkg/flow/componenttest/componenttest.go Outdated Show resolved Hide resolved
Comment on lines 5 to 13
type LabelStore interface {
GetOrAddLink(componentID string, localRefID uint64, lbls labels.Labels) uint64
GetOrAddGlobalRefID(l labels.Labels) uint64
GetGlobalRefID(componentID string, localRefID uint64) uint64
GetLocalRefID(componentID string, globalRefID uint64) uint64
AddStaleMarker(globalRefID uint64, l labels.Labels)
RemoveStaleMarker(globalRefID uint64)
CheckStaleMarkers()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some comments on this interface for clarity what this is for and when to use the methods? I think some implementation methods already have comments that could be a starting point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good will do.

service/labelstore/service.go Outdated Show resolved Hide resolved
}

// staleDuration determines how often we should wait after a stale value is received to GC that value
var staleDuration = time.Minute * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take this opportunity and make this configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it the code path is unused and trying to avoid changing behavior would like to push this to another pr.

var staleDuration = time.Minute * 10

// CheckStaleMarkers is called to garbage collect and items that have grown stale over stale duration (10m)
func (s *service) CheckStaleMarkers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckStaleMarkers doesn't sound like a good name IMO - doesn't reflect that this will delete old markers and garbage-collect effectively. We can call it just GarbageCollection() as it may serve multiple purposes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about CheckAndRemoveStaleMarkers, I feel GarbageCollection is a bit to general at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Or shorter CleanUpStaleMarkers.

var staleDuration = time.Minute * 10

// CheckStaleMarkers is called to garbage collect and items that have grown stale over stale duration (10m)
func (s *service) CheckStaleMarkers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to only be called from a test. Shouldn't we have something calling this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should but avoiding it for this PR since this is fork lifting functionality with ideally zero change.

if err != nil {
return nil, err
}
ls := service.(labelstore.LabelStore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be fine since in the current iteration they are the same to it and should be.

I don't follow. Why using a stale LocalRefID, from the previous configuration of a remote_write is not a problem?

@@ -103,10 +112,10 @@ func New(o component.Options, c Arguments) (*Component, error) {
return 0, fmt.Errorf("%s has exited", o.ID)
}

localID := prometheus.GlobalRefMapping.GetLocalRefID(res.opts.ID, uint64(globalRef))
localID := ls.GetLocalRefID(res.opts.ID, uint64(globalRef))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This localID may be for a previous configuration of the component and may no longer be valid. Is that okay?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be an issue with queue manager and seriesLabels but I think to fix that would require some Prometheus changes. Worst case a series would drop until the wal watcher reread a segment with them. This would also be a bug in our current incarnation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that would already be a bug :) Maybe we could prevent it by adding a component 'incarnation' ID to the current component ID, so they don't reuse labels when config changes. Let's park this discussion for the future.

Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mattdurham
Copy link
Collaborator Author

FYI created #5585 to handle wiring up garbage collection.

@mattdurham mattdurham enabled auto-merge (squash) October 24, 2023 14:50
@mattdurham mattdurham disabled auto-merge October 24, 2023 14:51
@mattdurham mattdurham merged commit b4b267d into main Oct 24, 2023
@mattdurham mattdurham deleted the globalref_service branch October 24, 2023 15:12
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants