-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce experimental cluster-aware metrics subsystem #1261
Conversation
e640b4c
to
4a3c107
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this for fun over some PTO so I wanted to give it a quick scan for some self feedback. Few high level thoughts so far:
-
I don't think we should have this in a dev branch forever, but I'm not sure if this is "ready" for main yet, even as a feature flag.
-
I've talked a few times about trying to use gRPC as a transport instead of gossip. I'm not sure if that's a blocker, but it would simplify at least some of the setup code for the cluster.
-
It feels a little awkward how much code was written for pkg/cluster. Should ckit be handling more?
-
I like the idea of calling metrics API endpoints and having them aggregate for every endpoint in the cluster, but I think maybe the wrong approach is being taken here. Should API endpoints default to being aggregated? Should aggregated API endpoints have their own dedicated path?
-
Maybe the k3d stuff should be a different PR since this is big enough as it is.
There's a lot of TODOs here, not all are blockers but I need to go through everything again a little bit later.
- job_name: demo.robustperception.io | ||
static_configs: | ||
- targets: ['demo.robustperception.io:3000'] | ||
labels: | ||
app: grafana | ||
- targets: ['demo.robustperception.io:9090'] | ||
labels: | ||
app: prometheus | ||
- targets: ['demo.robustperception.io:9100'] | ||
labels: | ||
app: node_exporter | ||
- targets: ['demo.robustperception.io:9093'] | ||
labels: | ||
app: alertmanager | ||
- targets: ['demo.robustperception.io:9091'] | ||
labels: | ||
app: pushgateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was used for quick testing but we probably shouldn't hammer the robustperception demo server by making it an official example config :)
@@ -0,0 +1,99 @@ | |||
local default = import 'default/main.libsonnet'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe any kind of k3d code should be moved to another PR and defer to docker-compose for the initial PR?
|
||
// ApplyDefaults will validate and apply defaults to the config. | ||
func (c *Config) ApplyDefaults(grpcPort int) error { | ||
if !c.Enable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there should be an explicit flag for enabling this 🤔
fs.BoolVar(&c.Enable, "cluster.enable", DefaultConfig.Enable, "Enables clustering.") | ||
|
||
fs.StringVar(&c.Discoverer.Name, "cluster.node-name", DefaultConfig.Discoverer.Name, "Name to identify node in cluster. If empty, defaults to hostname.") | ||
fs.StringVar(&c.Discoverer.ListenAddr, "cluster.listen-addr", DefaultConfig.Discoverer.ListenAddr, "IP address to listen for gossip traffic on. If not set, defaults to the first IP found from cluster.advertise-interfaces.") | ||
fs.StringVar(&c.Discoverer.AdvertiseAddr, "cluster.advertise-addr", DefaultConfig.Discoverer.AdvertiseAddr, "IP address to advertise to peers. If not set, defaults to the first IP found from cluster.advertise-interfaces.") | ||
fs.IntVar(&c.Discoverer.ListenPort, "cluster.listen-port", DefaultConfig.Discoverer.ListenPort, "Port to listen for TCP and UDP gossip traffic on.") | ||
|
||
fs.Var(&c.JoinPeers, "cluster.join-peers", "List of peers to join when starting up. Peers must have port number to connect to. Mutally exclusive with cluster.discover-peers.") | ||
fs.StringVar(&c.DiscoverPeers, "cluster.discover-peers", "", "Discover peers when starting up using Hashicorp cloud discover. If discovered peers do not have a port number, cluster.listen-port will be appended to each peer. Mutually exclusive with cluster.join-peers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is talked about briefly in #1140, but it'd be nice if gRPC was the transport for gossip given how we're only using gossip for membership awareness. Making that change would really slim down the set of flags here.
pkg/cluster/addr.go
Outdated
@@ -0,0 +1,75 @@ | |||
package cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions should probably be utility methods in ckit instead
type hashReader struct { | ||
hf chash.Hash | ||
ps map[string]*ckit.Peer | ||
} | ||
|
||
// Peers returns all known peers. The resulting map must not be modified | ||
// directly; make a copy if the data needs to be modified. | ||
func (hr *hashReader) Peers() map[string]*ckit.Peer { return hr.ps } | ||
|
||
// Get returns the owner of a key. An error will be returned if the hash fails | ||
// or if there aren't enough peers. | ||
func (hr *hashReader) Get(key string) (*ckit.Peer, error) { | ||
owners, err := hr.hf.Get(key, 1) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to get owner: %w", err) | ||
} else if len(owners) == 0 { | ||
return nil, fmt.Errorf("failed to get owner: no peers") | ||
} | ||
owner := hr.ps[owners[0]] | ||
if owner == nil { | ||
return nil, fmt.Errorf("failed to get owner: %q does not exist in cluster", owners[0]) | ||
} | ||
return owner, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to define this here makes it feel like there's some wrong abstractions going on.
It's probably not uncommon that many use cases would want a copy of a hasher, maybe this can be exposed from ckit instead?
func (m *Metrics) listDiscoveryTargetsHandler(w http.ResponseWriter, r *http.Request) { | ||
var ( | ||
peers = m.node.Peers() | ||
localOnly = r.URL.Query().Get("remote") == "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregated API is interesting but maybe should be exposed a different way? Or default to not being aggregated?
})) + | ||
container.withEnvMixin([ | ||
k.core.v1.envVar.fromFieldPath('HOSTNAME', 'spec.nodeName'), | ||
]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fixed in #1206, we can take this out.
// bindScraper binds the scraper to the storage. This is used for sending | ||
// metadata. | ||
bindScraper(sm *scrape.Manager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a pretty big issue I ran into. If we don't attach the scrape manager to remote write, then we don't get metadata for any of our scraped metrics (help/text info).
Unfortunately, this also means that v2 integrations aren't generating any metadata because they're being scraped by an external scraper. I'm not sure how we should handle that yet, but I'll open up an issue for tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1262
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second quick review:
I definitely think this should be split up across a few PRs to a dev branch.
Before that happens, I think it makes sense to make some changes to ckit:
- Support gRPC transport
- Support httpgrpc and creating an HTTP client to a cluster peer
- Find an easier way to get a copy of a hasher given the current set of peers instead of what's being done here
Those few changes should help clean up a lot of the helper code that was written. After that's done, this seems like a logical split for the PRs on the agent side:
- Add the agent-wide cluster, but don't hook it up into anything.
- Introduce the metrics-next feature flag and basic code for the subsystem.
- Add the API for the metrics subsystem
- k3d/docker-compose example for new metrics subsystem
|
||
// ApplyConfig will update the managed set of scrapers. ApplyConfig should not | ||
// be called util the storageSet is reconfigured first. | ||
func (sm *scraperManager) ApplyConfig(cfg *Config) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should detect if a scrape job goes away, since the Discoverer won't send an empty target set for scrape jobs it doesn't know about.
appendDefaultPort was incorrectly defaulting the port, instead removing the hostname.
86a6d76
to
7da7a52
Compare
I'm going to close this and split it up across a couple of PRs. |
This PR provides an initial implementation for #1140.
It does the following:
Introduces an agent-wide clustering mechanism based on rfratto/ckit, allowing any subsystem to perform hash ring lookups for node ownership checking.
Introduces a revamped cluster-aware metrics subsystem outlined by [RFC] New Metrics subsystem #1140, hidden behind a feature flag.
The revamped metrics subsystem has API endpoints to examine its state. These endpoints support aggregation, allowing you to request the API status of one agent and get the result for the whole cluster.
We may want to close this PR and split it up across multiple smaller PRs instead.