Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce experimental cluster-aware metrics subsystem #1261

Closed
wants to merge 15 commits into from

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jan 12, 2022

This PR provides an initial implementation for #1140.

It does the following:

  • Introduces an agent-wide clustering mechanism based on rfratto/ckit, allowing any subsystem to perform hash ring lookups for node ownership checking.

  • Introduces a revamped cluster-aware metrics subsystem outlined by [RFC] New Metrics subsystem #1140, hidden behind a feature flag.

  • The revamped metrics subsystem has API endpoints to examine its state. These endpoints support aggregation, allowing you to request the API status of one agent and get the result for the whole cluster.

We may want to close this PR and split it up across multiple smaller PRs instead.

Copy link
Member Author

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for fun over some PTO so I wanted to give it a quick scan for some self feedback. Few high level thoughts so far:

  1. I don't think we should have this in a dev branch forever, but I'm not sure if this is "ready" for main yet, even as a feature flag.

  2. I've talked a few times about trying to use gRPC as a transport instead of gossip. I'm not sure if that's a blocker, but it would simplify at least some of the setup code for the cluster.

  3. It feels a little awkward how much code was written for pkg/cluster. Should ckit be handling more?

  4. I like the idea of calling metrics API endpoints and having them aggregate for every endpoint in the cluster, but I think maybe the wrong approach is being taken here. Should API endpoints default to being aggregated? Should aggregated API endpoints have their own dedicated path?

  5. Maybe the k3d stuff should be a different PR since this is big enough as it is.

There's a lot of TODOs here, not all are blockers but I need to go through everything again a little bit later.

Comment on lines +10 to +26
- job_name: demo.robustperception.io
static_configs:
- targets: ['demo.robustperception.io:3000']
labels:
app: grafana
- targets: ['demo.robustperception.io:9090']
labels:
app: prometheus
- targets: ['demo.robustperception.io:9100']
labels:
app: node_exporter
- targets: ['demo.robustperception.io:9093']
labels:
app: alertmanager
- targets: ['demo.robustperception.io:9091']
labels:
app: pushgateway
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used for quick testing but we probably shouldn't hammer the robustperception demo server by making it an official example config :)

@@ -0,0 +1,99 @@
local default = import 'default/main.libsonnet';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe any kind of k3d code should be moved to another PR and defer to docker-compose for the initial PR?


// ApplyDefaults will validate and apply defaults to the config.
func (c *Config) ApplyDefaults(grpcPort int) error {
if !c.Enable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there should be an explicit flag for enabling this 🤔

Comment on lines +53 to +65
fs.BoolVar(&c.Enable, "cluster.enable", DefaultConfig.Enable, "Enables clustering.")

fs.StringVar(&c.Discoverer.Name, "cluster.node-name", DefaultConfig.Discoverer.Name, "Name to identify node in cluster. If empty, defaults to hostname.")
fs.StringVar(&c.Discoverer.ListenAddr, "cluster.listen-addr", DefaultConfig.Discoverer.ListenAddr, "IP address to listen for gossip traffic on. If not set, defaults to the first IP found from cluster.advertise-interfaces.")
fs.StringVar(&c.Discoverer.AdvertiseAddr, "cluster.advertise-addr", DefaultConfig.Discoverer.AdvertiseAddr, "IP address to advertise to peers. If not set, defaults to the first IP found from cluster.advertise-interfaces.")
fs.IntVar(&c.Discoverer.ListenPort, "cluster.listen-port", DefaultConfig.Discoverer.ListenPort, "Port to listen for TCP and UDP gossip traffic on.")

fs.Var(&c.JoinPeers, "cluster.join-peers", "List of peers to join when starting up. Peers must have port number to connect to. Mutally exclusive with cluster.discover-peers.")
fs.StringVar(&c.DiscoverPeers, "cluster.discover-peers", "", "Discover peers when starting up using Hashicorp cloud discover. If discovered peers do not have a port number, cluster.listen-port will be appended to each peer. Mutually exclusive with cluster.join-peers.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is talked about briefly in #1140, but it'd be nice if gRPC was the transport for gossip given how we're only using gossip for membership awareness. Making that change would really slim down the set of flags here.

@@ -0,0 +1,75 @@
package cluster
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions should probably be utility methods in ckit instead

Comment on lines +85 to +108
type hashReader struct {
hf chash.Hash
ps map[string]*ckit.Peer
}

// Peers returns all known peers. The resulting map must not be modified
// directly; make a copy if the data needs to be modified.
func (hr *hashReader) Peers() map[string]*ckit.Peer { return hr.ps }

// Get returns the owner of a key. An error will be returned if the hash fails
// or if there aren't enough peers.
func (hr *hashReader) Get(key string) (*ckit.Peer, error) {
owners, err := hr.hf.Get(key, 1)
if err != nil {
return nil, fmt.Errorf("failed to get owner: %w", err)
} else if len(owners) == 0 {
return nil, fmt.Errorf("failed to get owner: no peers")
}
owner := hr.ps[owners[0]]
if owner == nil {
return nil, fmt.Errorf("failed to get owner: %q does not exist in cluster", owners[0])
}
return owner, nil
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to define this here makes it feel like there's some wrong abstractions going on.

It's probably not uncommon that many use cases would want a copy of a hasher, maybe this can be exposed from ckit instead?

func (m *Metrics) listDiscoveryTargetsHandler(w http.ResponseWriter, r *http.Request) {
var (
peers = m.node.Peers()
localOnly = r.URL.Query().Get("remote") == "0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregated API is interesting but maybe should be exposed a different way? Or default to not being aggregated?

Comment on lines +47 to +50
})) +
container.withEnvMixin([
k.core.v1.envVar.fromFieldPath('HOSTNAME', 'spec.nodeName'),
]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed in #1206, we can take this out.

Comment on lines +61 to +63
// bindScraper binds the scraper to the storage. This is used for sending
// metadata.
bindScraper(sm *scrape.Manager)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a pretty big issue I ran into. If we don't attach the scrape manager to remote write, then we don't get metadata for any of our scraped metrics (help/text info).

Unfortunately, this also means that v2 integrations aren't generating any metadata because they're being scraped by an external scraper. I'm not sure how we should handle that yet, but I'll open up an issue for tracking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1262

@rfratto rfratto changed the title dev.metrics-v2 Introduce experimental cluster-aware metrics subsystem Jan 13, 2022
Copy link
Member Author

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second quick review:

I definitely think this should be split up across a few PRs to a dev branch.

Before that happens, I think it makes sense to make some changes to ckit:

  • Support gRPC transport
  • Support httpgrpc and creating an HTTP client to a cluster peer
  • Find an easier way to get a copy of a hasher given the current set of peers instead of what's being done here

Those few changes should help clean up a lot of the helper code that was written. After that's done, this seems like a logical split for the PRs on the agent side:

  1. Add the agent-wide cluster, but don't hook it up into anything.
  2. Introduce the metrics-next feature flag and basic code for the subsystem.
  3. Add the API for the metrics subsystem
  4. k3d/docker-compose example for new metrics subsystem


// ApplyConfig will update the managed set of scrapers. ApplyConfig should not
// be called util the storageSet is reconfigured first.
func (sm *scraperManager) ApplyConfig(cfg *Config) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should detect if a scrape job goes away, since the Discoverer won't send an empty target set for scrape jobs it doesn't know about.

@rfratto
Copy link
Member Author

rfratto commented Jan 29, 2022

I'm going to close this and split it up across a couple of PRs.

@rfratto rfratto closed this Jan 29, 2022
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 4, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant