Introduce experimental cluster-aware metrics subsystem #1261

rfratto · 2022-01-12T18:27:08Z

This PR provides an initial implementation for #1140.

It does the following:

Introduces an agent-wide clustering mechanism based on rfratto/ckit, allowing any subsystem to perform hash ring lookups for node ownership checking.
Introduces a revamped cluster-aware metrics subsystem outlined by [RFC] New Metrics subsystem #1140, hidden behind a feature flag.
The revamped metrics subsystem has API endpoints to examine its state. These endpoints support aggregation, allowing you to request the API status of one agent and get the result for the whole cluster.

We may want to close this PR and split it up across multiple smaller PRs instead.

rfratto

I did this for fun over some PTO so I wanted to give it a quick scan for some self feedback. Few high level thoughts so far:

I don't think we should have this in a dev branch forever, but I'm not sure if this is "ready" for main yet, even as a feature flag.
I've talked a few times about trying to use gRPC as a transport instead of gossip. I'm not sure if that's a blocker, but it would simplify at least some of the setup code for the cluster.
It feels a little awkward how much code was written for pkg/cluster. Should ckit be handling more?
I like the idea of calling metrics API endpoints and having them aggregate for every endpoint in the cluster, but I think maybe the wrong approach is being taken here. Should API endpoints default to being aggregated? Should aggregated API endpoints have their own dedicated path?
Maybe the k3d stuff should be a different PR since this is big enough as it is.

There's a lot of TODOs here, not all are blockers but I need to go through everything again a little bit later.

rfratto · 2022-01-12T23:34:59Z

cmd/agent/agent-cluster-config.yaml

+    - job_name: demo.robustperception.io
+      static_configs:
+      - targets: ['demo.robustperception.io:3000']
+        labels:
+          app: grafana
+      - targets: ['demo.robustperception.io:9090']
+        labels:
+          app: prometheus
+      - targets: ['demo.robustperception.io:9100']
+        labels:
+          app: node_exporter
+      - targets: ['demo.robustperception.io:9093']
+        labels:
+          app: alertmanager
+      - targets: ['demo.robustperception.io:9091']
+        labels:
+          app: pushgateway


This was used for quick testing but we probably shouldn't hammer the robustperception demo server by making it an official example config :)

rfratto · 2022-01-12T23:36:07Z

example/k3d/environment/metrics-next/main.jsonnet

@@ -0,0 +1,99 @@
+local default = import 'default/main.libsonnet';


Maybe any kind of k3d code should be moved to another PR and defer to docker-compose for the initial PR?

rfratto · 2022-01-12T23:38:40Z

pkg/cluster/cluster.go

+
+// ApplyDefaults will validate and apply defaults to the config.
+func (c *Config) ApplyDefaults(grpcPort int) error {
+	if !c.Enable {


I'm not sure if there should be an explicit flag for enabling this 🤔

rfratto · 2022-01-12T23:39:30Z

pkg/cluster/cluster.go

+	fs.BoolVar(&c.Enable, "cluster.enable", DefaultConfig.Enable, "Enables clustering.")
+
+	fs.StringVar(&c.Discoverer.Name, "cluster.node-name", DefaultConfig.Discoverer.Name, "Name to identify node in cluster. If empty, defaults to hostname.")
+	fs.StringVar(&c.Discoverer.ListenAddr, "cluster.listen-addr", DefaultConfig.Discoverer.ListenAddr, "IP address to listen for gossip traffic on. If not set, defaults to the first IP found from cluster.advertise-interfaces.")
+	fs.StringVar(&c.Discoverer.AdvertiseAddr, "cluster.advertise-addr", DefaultConfig.Discoverer.AdvertiseAddr, "IP address to advertise to peers. If not set, defaults to the first IP found from cluster.advertise-interfaces.")
+	fs.IntVar(&c.Discoverer.ListenPort, "cluster.listen-port", DefaultConfig.Discoverer.ListenPort, "Port to listen for TCP and UDP gossip traffic on.")
+
+	fs.Var(&c.JoinPeers, "cluster.join-peers", "List of peers to join when starting up. Peers must have port number to connect to. Mutally exclusive with cluster.discover-peers.")
+	fs.StringVar(&c.DiscoverPeers, "cluster.discover-peers", "", "Discover peers when starting up using Hashicorp cloud discover. If discovered peers do not have a port number, cluster.listen-port will be appended to each peer. Mutually exclusive with cluster.join-peers.")


This is talked about briefly in #1140, but it'd be nice if gRPC was the transport for gossip given how we're only using gossip for membership awareness. Making that change would really slim down the set of flags here.

rfratto · 2022-01-12T23:40:10Z

pkg/cluster/addr.go

@@ -0,0 +1,75 @@
+package cluster


These functions should probably be utility methods in ckit instead

rfratto · 2022-01-12T23:44:03Z

pkg/metrics/v2/hasher.go

+type hashReader struct {
+	hf chash.Hash
+	ps map[string]*ckit.Peer
+}
+
+// Peers returns all known peers. The resulting map must not be modified
+// directly; make a copy if the data needs to be modified.
+func (hr *hashReader) Peers() map[string]*ckit.Peer { return hr.ps }
+
+// Get returns the owner of a key. An error will be returned if the hash fails
+// or if there aren't enough peers.
+func (hr *hashReader) Get(key string) (*ckit.Peer, error) {
+	owners, err := hr.hf.Get(key, 1)
+	if err != nil {
+		return nil, fmt.Errorf("failed to get owner: %w", err)
+	} else if len(owners) == 0 {
+		return nil, fmt.Errorf("failed to get owner: no peers")
+	}
+	owner := hr.ps[owners[0]]
+	if owner == nil {
+		return nil, fmt.Errorf("failed to get owner: %q does not exist in cluster", owners[0])
+	}
+	return owner, nil
+}


Having to define this here makes it feel like there's some wrong abstractions going on.

It's probably not uncommon that many use cases would want a copy of a hasher, maybe this can be exposed from ckit instead?

rfratto · 2022-01-12T23:48:27Z

pkg/metrics/v2/metrics_http.go

+func (m *Metrics) listDiscoveryTargetsHandler(w http.ResponseWriter, r *http.Request) {
+	var (
+		peers     = m.node.Peers()
+		localOnly = r.URL.Query().Get("remote") == "0"


The aggregated API is interesting but maybe should be exposed a different way? Or default to not being aggregated?

rfratto · 2022-01-12T23:51:44Z

production/tanka/grafana-agent/v2/internal/base.libsonnet

+    })) +
+    container.withEnvMixin([
+      k.core.v1.envVar.fromFieldPath('HOSTNAME', 'spec.nodeName'),
+    ]),


This was fixed in #1206, we can take this out.

rfratto · 2022-01-13T00:01:03Z

pkg/metrics/v2/sender.go

+	// bindScraper binds the scraper to the storage. This is used for sending
+	// metadata.
+	bindScraper(sm *scrape.Manager)


This was actually a pretty big issue I ran into. If we don't attach the scrape manager to remote write, then we don't get metadata for any of our scraped metrics (help/text info).

Unfortunately, this also means that v2 integrations aren't generating any metadata because they're being scraped by an external scraper. I'm not sure how we should handle that yet, but I'll open up an issue for tracking.

rfratto

Second quick review:

I definitely think this should be split up across a few PRs to a dev branch.

Before that happens, I think it makes sense to make some changes to ckit:

Support gRPC transport
Support httpgrpc and creating an HTTP client to a cluster peer
Find an easier way to get a copy of a hasher given the current set of peers instead of what's being done here

Those few changes should help clean up a lot of the helper code that was written. After that's done, this seems like a logical split for the PRs on the agent side:

Add the agent-wide cluster, but don't hook it up into anything.
Introduce the metrics-next feature flag and basic code for the subsystem.
Add the API for the metrics subsystem
k3d/docker-compose example for new metrics subsystem

rfratto · 2022-01-21T15:44:05Z

pkg/metrics/v2/scraper.go

+
+// ApplyConfig will update the managed set of scrapers. ApplyConfig should not
+// be called util the storageSet is reconfigured first.
+func (sm *scraperManager) ApplyConfig(cfg *Config) error {


This should detect if a scrape job goes away, since the Discoverer won't send an empty target set for scrape jobs it doesn't know about.

…scaped

appendDefaultPort was incorrectly defaulting the port, instead removing the hostname.

rfratto · 2022-01-29T01:05:18Z

I'm going to close this and split it up across a couple of PRs.

rfratto force-pushed the dev.metrics-v2 branch from e640b4c to 4a3c107 Compare January 12, 2022 23:33

rfratto commented Jan 13, 2022

View reviewed changes

rfratto changed the title ~~dev.metrics-v2~~ Introduce experimental cluster-aware metrics subsystem Jan 13, 2022

rfratto commented Jan 13, 2022

View reviewed changes

rfratto mentioned this pull request Apr 11, 2024

Manually stored metrics do not have metadata grafana/alloy#547

Open

rfratto commented Jan 21, 2022

View reviewed changes

rfratto added 15 commits January 28, 2022 13:50

add top-level cluster subsystem

a7ab216

add metrics-next experimental feature

1661420

working example (potentially) (maybe)

3bc257e

metrics/v2: start on an API

6874b66

modify discovery jobs API to include node name

9541c7e

metrics/v2: make discovery targets API node-aware

c0d8704

metrics/v2: add API for current scrape targets

1c870e5

add metrics for discoverer, scraper, sender

119b4c2

pkg/cluster: fix bug where escaped discovery strings were getting une…

460a2a2

…scaped

pkg/cluster: appendDefaultPort incorrect

cd5e285

appendDefaultPort was incorrectly defaulting the port, instead removing the hostname.

example/k3d: create metrics-next example environment

66745d5

metrics/v2: fix metrics metadata

38a1586

fix compile errors after rebase, lint errors

6179dfa

use rfratto/ckit/advertise to find advertise IP

2e5297c

use ckit/httpgrpc

7da7a52

rfratto force-pushed the dev.metrics-v2 branch from 86a6d76 to 7da7a52 Compare January 28, 2022 18:51

rfratto closed this Jan 29, 2022

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 4, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce experimental cluster-aware metrics subsystem #1261

Introduce experimental cluster-aware metrics subsystem #1261

rfratto commented Jan 12, 2022 •

edited

Loading

rfratto left a comment

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 12, 2022

rfratto Jan 13, 2022

rfratto Jan 21, 2022

rfratto left a comment

rfratto Jan 21, 2022

rfratto commented Jan 29, 2022

		@@ -0,0 +1,99 @@
		local default = import 'default/main.libsonnet';

Introduce experimental cluster-aware metrics subsystem #1261

Introduce experimental cluster-aware metrics subsystem #1261

Conversation

rfratto commented Jan 12, 2022 • edited Loading

rfratto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfratto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfratto commented Jan 29, 2022

rfratto commented Jan 12, 2022 •

edited

Loading