feat: add storage utilization metrics #7868

lct45 · 2021-07-23T16:27:00Z

Description

As part of the observability metrics we're adding in Q3, we'd like to see storage metrics. This adds storage metrics by node, task, and query.

Testing done

Tested locally that the metrics picked up, added unit tests, did benchmarking

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

lct45 · 2021-07-23T16:29:14Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+      final long totalSpace = f.getTotalSpace();
+      final double percFree = percentage(freeSpace, (double) totalSpace);
+      dataPoints.add(new MetricsReporter.DataPoint(sampleTime,"storage-usage", freeSpace));
+      dataPoints.add(new MetricsReporter.DataPoint(sampleTime,"storage-usage-perc", percFree));


I found that looking at the raw free space number didn't give me a sense of how full my disk was, but maybe other people automatically know how much space there is total? For task level we don't have total file size so maybe we don't want it here either for consistency

ksqldb-engine/src/main/java/io/confluent/ksql/internal/MetricsReporter.java

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

guozhangwang · 2021-07-23T19:59:26Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+      final long totalSpace = f.getTotalSpace();
+      final long usedSpace = totalSpace - f.getFreeSpace();
+      final double percFree = percentage((double) usedSpace, (double) totalSpace);
+      dataPoints.add(new MetricsReporter.DataPoint(sampleTime,"storage-usage", (double) usedSpace));


nit: space after comma.

Also: seems we are only keep three data points here since the next file's usage would simply overwrite the previous ones, is that intentional??

Hmm yeah good point @guozhangwang. Looking back at this I'm really reporting storage usage by query which I do later. I switched it to aggregate all of the files and return the final result as the node reporting. I saw your comment back on the one-pager about if getting the query storage usage here with f.getFreeSpace will be more accurate than aggregating sst-file-size, I'll leave log lines in and see if I can get a read on that during benchmarks

Makes sense, thanks.

guozhangwang

Made a second pass; but it seems this PR is not complete and ready for review yet? @lct45 please feel free to ping me when it's done.

guozhangwang · 2021-07-30T21:52:32Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+      );
+      metrics.put(queryId, new ArrayList<>());
+    }
+    // We've seen metric for this query before


nit: for this query's task

guozhangwang · 2021-07-30T21:57:11Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+  private final Collection<TaskStorageMetric> registeredMetrics;
+
+  public UtilizationMetricsListener(final KsqlConfig config, final Metrics metricRegistry) {
+    this.queryStorageMetrics = new ConcurrentHashMap<>();


Could you elaborate a bit more about the differences of these three collections: the metrics map, the queryStorageMetrics map, and the registeredMetrics set. From what I can see it seems:

queryStorageMetrics is used for the per-query metrics, but it was not populated and updated anywhere.

registeredMetrics is used for per-query-task metrics

metrics is also used for storing per-query metrics?

The relationships of these and how they are leveraged / updated are all not very clear to me.

I think we can simplify things here by doing a few things:

First, repurpose TaskStorageMetric as a general purpose gauge that you can use to implement measure by adding up other metrics. It already is this, you just need to drop the notion of a task ID from it. You can rename to AggregatedMetric (or maybe even refactor RocksDBMetricCollector to pull out that inner class and use it).

Then, maintain 2 maps:
Map<String, AggregatedMetric> taskStorageUsage - this maps from QueryID to an aggregated metric that measures usage for the query. It's aggregated metric will contain all the state store metrics for that query.
Map<TaskID, AggregatedMetric> queryStorageUsage - this maps from Task ID to an aggregated metric that measures usage for the task. It's aggregated metric will contain all the state store metrics for that task. You'll probably have to add a TaskID type here that includes the task ID and query ID.

Then, whenever you get an event for a new metric, first compute the QueryID and TaskID. If the task is new, then add a new task-level metric to the metric registry (metricsRegistry) and add a new AggregatedMetric to taskStorageUsage. If the query is new, then add a new query-level metric to the metric registry and add a new AggregatedMetric to queryStorageUSage. Then, add the metric you're being notified on to the AggregatedMetric for the task and query.

Similarly, on a notification that a state store metric is removed, remove the state store metric from the corresponding AggregatedMetric instances. If the AggregatedMetric instance is now empty, then unregister it from the metrics registry.

guozhangwang · 2021-07-30T21:57:33Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+    if (metrics.get(queryId).contains(taskId)) {
+      // we have this task metric already, just need to update it
+      resetMetric(metric);
+      for (TaskStorageMetric storageMetric : registeredMetrics) {


Why not put this for loop into the resetMetric as well??

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

rodesai · 2021-08-05T04:57:29Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+    }
+    final String taskId = metric.metricName().tags().getOrDefault("task-id", "");
+
+    final String queryId = "";


we should be able to get this from the task id or the thread id, but it depends on whether we're using the new consolidated runtime stuff or not.

currently the thread-id tag will have the form _confluent-ksql-<server id>query_<query ID>-<uuid>-StreamThread-0 for persistent queries, and the form _confluent-ksql-<server id>transient_<query ID>-<uuid>-StreamThread-0 for transient queries.

once the new runtime changes go in, you would need to look in the task ID. @ableegoldman can you advise on how @lct45 would get the query ID from the task ID for the new runtime?

rodesai · 2021-08-05T05:45:26Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+  private final Collection<TaskStorageMetric> registeredMetrics;
+
+  public UtilizationMetricsListener(final KsqlConfig config, final Metrics metricRegistry) {
+    this.queryStorageMetrics = new ConcurrentHashMap<>();


I think we can simplify things here by doing a few things:

First, repurpose TaskStorageMetric as a general purpose gauge that you can use to implement measure by adding up other metrics. It already is this, you just need to drop the notion of a task ID from it. You can rename to AggregatedMetric (or maybe even refactor RocksDBMetricCollector to pull out that inner class and use it).

Then, maintain 2 maps:
Map<String, AggregatedMetric> taskStorageUsage - this maps from QueryID to an aggregated metric that measures usage for the query. It's aggregated metric will contain all the state store metrics for that query.
Map<TaskID, AggregatedMetric> queryStorageUsage - this maps from Task ID to an aggregated metric that measures usage for the task. It's aggregated metric will contain all the state store metrics for that task. You'll probably have to add a TaskID type here that includes the task ID and query ID.

Then, whenever you get an event for a new metric, first compute the QueryID and TaskID. If the task is new, then add a new task-level metric to the metric registry (metricsRegistry) and add a new AggregatedMetric to taskStorageUsage. If the query is new, then add a new query-level metric to the metric registry and add a new AggregatedMetric to queryStorageUSage. Then, add the metric you're being notified on to the AggregatedMetric for the task and query.

Similarly, on a notification that a state store metric is removed, remove the state store metric from the corresponding AggregatedMetric instances. If the AggregatedMetric instance is now empty, then unregister it from the metrics registry.

rodesai · 2021-08-11T03:36:53Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+      );
+      metricsSeen.put(queryId, new ArrayList<>());
+    }
+    final TaskStorageMetric newMetric = new TaskStorageMetric(


I think you can simplify this bit with a few changes:

make metricsSeen a map from query id to a map from task id to TaskStorageMetric (e.g. Map<String, Map<String, TaskStorageMetric>)

there's no need for resetAndUpdateMetric - you just need to call add on TaskStorageMetric and the value will get replaced in the map. Also technically that case should never actually happen - consider adding a warning log

Then this bit can just be:

final TaskStorageMetric taskMetric; if (!metricsSeen.get(queryId).contains(taskId)) { taskMetric = new TaskStorageMetric(...); metricsSeen.get(queryId).add(taskId, new taskMetric); metricRegistry.addMetric(...) } else { taskmetric = metricsSeen.get(queryId).get(taskId); } taskMetric.add(metric);

This should simplify removal too since you don't have to iterate the whole list.

What case should never happen, calling add on an existing TaskStorageMetric? The rest of this makes sense and makes it cleaner

rodesai · 2021-08-11T03:39:03Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+    Matcher matcher = pattern.matcher(queryIdTag);
+    final String queryId = matcher.find() ? matcher.group(1) : "";
+    // if we haven't seen a task for this query yet
+    if (!metricsSeen.containsKey(queryId)) {


everything from this line down to the end of this method needs to be in a synchronized block. I wouldn't synchronize this whole method though because most of the time it's called it doesn't need to do anything (since most metrics will fail the initial check).

Synchronized on metricsSeen, not metricsRegistry, right?

rodesai · 2021-08-11T03:42:16Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+      this.metricName = metricName;
+    }
+
+    private void add(final KafkaMetric metric) {


add, remove, and getValue need to be synchronized - streams threads may be creating/removing state stores while telemetry is reading the value

never mind - noticed this is a ConcurrentHashMap

rodesai · 2021-08-11T03:43:35Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

+    }
+  }
+
+  final BigInteger computeQueryMetric(final String queryId) {


this needs to be synchronized with metric addition and removal.

Also, to be extra careful I would call metricValue outside of the lock. metricValue is going to call into Streams, which could in theory (though very unlikely) try to take a lock that another stream thread holds while waiting to get our lock (because that thread is trying to register a metric).

Just to clarify, is this still true since the metric map is a concurrent hash map?

ksqldb-engine/src/main/java/io/confluent/ksql/internal/UtilizationMetricsListener.java

lct45 · 2021-08-12T14:51:07Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+    final MetricName nodeTotal =
+        metricRegistry.metricName("node-storage-total", METRIC_GROUP);
+    final MetricName nodeUsed =
+        metricRegistry.metricName("node-storage-used", METRIC_GROUP);


@rodesai I can't remember what we landed on for node metrics - IIRC calculated storage used by doing (f.getTotalSpace - f.getFreeSpace) isn't actually accurate for what we want. Given that, do we just want to report f.getFreeSpace? Or f.getFreeSpace and f.getTotalSpace?

Additionally for naming these + the other storage metrics, do we want to add a unit suffix? eg _bytes

IIRC calculated storage used by doing (f.getTotalSpace - f.getFreeSpace) isn't actually accurate for what we want. Given that, do we just want to report f.getFreeSpace? Or f.getFreeSpace and f.getTotalSpace?

I thought we determined it was probably accurate, and the disparity we saw was very small (18MB) and probably explained by the workload continuing to run between the time you printed the metrics and the time you looked at df. So we decided to get a PR out and we can test on a cloud instance rather than on our macbooks.

And yeah a _bytes suffix makes sense - good call

rodesai · 2021-08-12T16:37:01Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+    final MetricName nodeTotal =
+        metricRegistry.metricName("node-storage-total", METRIC_GROUP);
+    final MetricName nodeUsed =
+        metricRegistry.metricName("node-storage-used", METRIC_GROUP);


IIRC calculated storage used by doing (f.getTotalSpace - f.getFreeSpace) isn't actually accurate for what we want. Given that, do we just want to report f.getFreeSpace? Or f.getFreeSpace and f.getTotalSpace?

I thought we determined it was probably accurate, and the disparity we saw was very small (18MB) and probably explained by the workload continuing to run between the time you printed the metrics and the time you looked at df. So we decided to get a PR out and we can test on a cloud instance rather than on our macbooks.

And yeah a _bytes suffix makes sense - good call

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

rodesai · 2021-08-12T16:47:22Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+    final String queryId = getQueryId(metric);
+
+    // if we haven't seen a task for this query yet
+    synchronized (metricsSeen) {


can you move this into another method like private synchronized void handleNewSstFilesSizeMetric(final KafkaMetric metric, final String taskId, final String queryId) {...?

ditto for the other side of this in metricRemoval

And we have to do this because the hashmap is a concurrent hashmap, right? So when we access it we need to do it in a synchronized fashion? Is there something different about using a synchronized function rather than a synchronized block or is the synchronized function easier to use

rodesai · 2021-08-12T16:48:05Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+
+    final String queryId = getQueryId(metric);
+    final String taskId = metric.metricName().tags().getOrDefault("task-id", "");
+    final TaskStorageMetric taskMetric = metricsSeen.get(queryId).get(taskId);


move everything below this into another method like:

private synchronized void handleRemovedSstFileSizeMetric(...

rodesai · 2021-08-12T16:53:24Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+
+  private BigInteger computeQueryMetric(final String queryId) {
+    BigInteger queryMetricSum = BigInteger.ZERO;
+    for (Map.Entry<String, TaskStorageMetric> entry : metricsSeen.get(queryId).entrySet()) {


we need to synchronize the bit here that iterates over the map. However we don't want to synchronize the part that gets the metric (to eliminate the risk of a deadlock). So you can write this like:

private BigInteger computeQueryMetric(final String queryId) { BigInteger queryMetricSum = BigInteger.ZERO; for (final Supplier<BigInteger> gauge : getGaugesForQuery(queryId)) { queryMetricSum = queryMetricSum.add(gauge.get()); } return queryMetricSum; } private synchronized Collection<Supplier<BigInteger>> getGaugesForQuery(final String queryId) { return metricsSeen.get(queryId).values().stream() .map(v -> () -> v.getValue()) .collect(Collectors.toList()); }

guozhangwang · 2021-08-13T05:26:37Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+import org.apache.kafka.common.metrics.MetricsReporter;
+import org.apache.kafka.streams.StreamsConfig;
+
+public class StorageUtilizationMetrics implements MetricsReporter {


nit: StorageUtilizationMetricsReporter.

Also I'm wondering when would we implement org.apache.kafka.common.metrics.MetricsReporter and when we implement io.confluent.ksql.internal.MetricsReporter, they seem to be serving the same purpose.

@rodesai may have something more specific here but IIUC io.confluent.ksql.internal.MetricsReporter is used here to report a ksql metric object dataPoint and org.apache.kafka.common.metrics.MetricsReporter is used for KafkaMetrics

guozhangwang · 2021-08-13T05:47:32Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+    final String queryIdTag = metric.metricName().tags().getOrDefault("thread-id", "");
+    final Pattern pattern = Pattern.compile("(?<=query_|transient_)(.*?)(?=-)");
+    final Matcher matcher = pattern.matcher(queryIdTag);
+    return matcher.find() ? matcher.group(1) : "";


Would we ever not find the match? When that happens we would not get the queryId we would put "" into metricsSeen silently. Should we treat it as a bug instead?

We should always find a match for queries, although my regex skills aren't incredible so it depends on if anyone else sees any issues on L 233 😉. But I think you're right - we should treat it as a bug. I'm hesitant to crash the app if the queryId isn't showing up for a metric, on the other hand, if we only log an error we likely won't know that we're missing metrics... We also don't want to continue reporting any of these metrics if we don't have a queryID imo, so maybe throwing is the right way to go. LMK what you think

guozhangwang · 2021-08-13T06:03:08Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+
+  private static class TaskStorageMetric {
+    final MetricName metricName;
+    private final Map<MetricName, KafkaMetric> metrics = new ConcurrentHashMap<>();


Hmm, are we having more than one metric here? Since we are filtering on metric.metricName().name().equals("total-sst-files-size") at the first place, we would end up with only that metric name right?

From what I understand, total-sst-files-size is broken down by store, so we may have multiple stores per task. The storeName is included in the tag map so the metric names would be different if you had multiple stores under one task

Ah yes, you're right :)

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

guozhangwang · 2021-08-13T06:07:35Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetrics.java

+      final String taskId
+  ) {
+    // remove storage metric for this task
+    taskMetric.remove(metric);


See my other comment: if we would only ever have one metric under that task, then we can simplify this a bit.

rodesai · 2021-08-14T00:11:25Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetricsReporter.java

+        nodeTotal,
+        (Gauge<Long>) (config, now) -> baseDir.getTotalSpace()
+    );
+    metricRegistry.addMetric(


can you add a storage_utilization metric that returns ((baseDir.getTotalSpace - baseDir.getFreeSpace) / baseDir.getTotalSpace)

There should be no risk of division by 0 here, right? Since our baseDir should always exist

And do you want this all the way up to the metrics api + exposed to users?

There should be no risk of division by 0 here, right? Since our baseDir should always exist

yeah that should never happen

And do you want this all the way up to the metrics api + exposed to users?

yeah, the max value of this metric (over all the nodes) is what we want to show users in cloud

rodesai

LGTM

guozhangwang · 2021-08-19T00:47:14Z

LGTM.

rodesai

LGTM

sirianni · 2021-12-10T15:50:06Z

ksqldb-engine/src/main/java/io/confluent/ksql/internal/StorageUtilizationMetricsReporter.java

+import org.apache.kafka.common.metrics.MetricsReporter;
+import org.apache.kafka.streams.StreamsConfig;
+
+public class StorageUtilizationMetricsReporter implements MetricsReporter {


I don't understand why you are implementing a new MetricsReporter here.

A MetricsReporter is for reporting metrics via different mechanisms (JMX, etc.). For defining new metrics, you typically just create a new KafkaMetric and call Metrics.addMetric().

lct45 requested a review from a team as a code owner July 23, 2021 16:27

lct45 requested review from cadonna and rodesai July 23, 2021 16:27

lct45 commented Jul 23, 2021

View reviewed changes

guozhangwang reviewed Jul 23, 2021

View reviewed changes

guozhangwang reviewed Jul 30, 2021

View reviewed changes

rodesai reviewed Aug 10, 2021

View reviewed changes

rodesai reviewed Aug 11, 2021

View reviewed changes

lct45 commented Aug 12, 2021

View reviewed changes

rodesai reviewed Aug 12, 2021

View reviewed changes

guozhangwang reviewed Aug 13, 2021

View reviewed changes

lct45 force-pushed the disk_util branch 2 times, most recently from c1e929a to b231ae0 Compare August 13, 2021 15:57

rodesai reviewed Aug 14, 2021

View reviewed changes

lct45 force-pushed the disk_util branch from 9f9e069 to 908fc84 Compare August 16, 2021 17:53

rodesai approved these changes Aug 17, 2021

View reviewed changes

lct45 force-pushed the disk_util branch 2 times, most recently from a72c538 to a6c9d24 Compare August 17, 2021 20:07

lct45 force-pushed the disk_util branch 9 times, most recently from c95154a to ed287ab Compare August 24, 2021 17:25

lct45 added 2 commits August 24, 2021 13:59

initial disk util by node

f20a0b6

task level disk usage + cleanup

5ab2fa5

lct45 added 14 commits August 24, 2021 13:59

initial integration wtih datapoints

d63c8fa

data points integration

9d3bba9

fix math assumptions

38352e1

review updates

7aadf90

disk metrics revamp

3593786

switch to metricsreporter interface

286cbdc

cleaner implementation using metrics reporter

490670e

review updates

4205bf0

new testing

be30460

review notes

72ecf07

add storage utilization metric

6c033c8

spotbugs

afa0ba2

suppress findbugs

3a506f8

removing listeners from ksql engine:

20a63a7

lct45 force-pushed the disk_util branch from ed287ab to 20a63a7 Compare August 24, 2021 19:04

rodesai approved these changes Aug 24, 2021

View reviewed changes

lct45 merged commit 22a8741 into confluentinc:master Aug 25, 2021

sirianni reviewed Dec 10, 2021

View reviewed changes

feat: add storage utilization metrics #7868

feat: add storage utilization metrics #7868

Conversation

lct45 commented Jul 23, 2021 • edited Loading

Description

Testing done

Reviewer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

guozhangwang commented Aug 19, 2021

rodesai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lct45 commented Jul 23, 2021 •

edited

Loading