feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] #1285

untitaker · 2022-06-02T14:49:50Z

Implement routing metrics to separate topics based on their usecase. We
currently have exactly two usecases (Session- and transaction metrics)
so those are just hardcoded as known usecases. The config format is
backwards-compatible, and by default both usecases still go to the
ingest-metrics topic.

This change is breaking in that metrics from unknown usecases
(outside of release health and metrics-enhanced performance) are dropped
in processing relays.

External and PoP-relays continue to forward arbitrary metrics... as
long as they have a usecase (any string).

As part of that, refactor MRI validation into actual MRI parsing,
which now also means that all metrics have to have a usecase.

] Implement routing metrics to separate topics based on their usecase. We currently have exactly two usecases (Session- and transaction metrics) so those are just hardcoded as known usecases. As part of that, refactor MRI validation into actual MRI _parsing_, which now also means that all metrics have to have a usecase.

relay-metrics/src/protocol.rs

jan-auer · 2022-06-03T11:14:52Z

relay-metrics/src/protocol.rs

-    }
+/// A metric name parsed as MRI, a naming scheme which includes most of the metric's bucket key
+/// (excl. timestamp and tags).
+pub struct MetricMri {


MRI stands for "MetricResourceIdentifier", so the "MetricMri" is redundant. The acronym is unique enough to just call it Mri, or fully type it out.

If the name is refactored, I rather MetricResourceIdentifier over Mri.

relay-metrics/src/protocol.rs

relay-server/src/actors/store.rs

jan-auer · 2022-06-03T11:24:36Z

relay-config/src/config.rs

+    /// Topic name for metrics extracted from sessions. Defaults to the assignment of `metrics`.
+    pub metrics_sessions: Option<TopicAssignment>,
+    /// Topic name for metrics extracted from transactions. Defaults to the assignment of `metrics`.
+    pub metrics_transactions: Option<TopicAssignment>,


nit: If we introduce a MetricNamespace enum, this could become a mapping of MetricNamespace => TopicAssignment, which would allow for more generic code in all other places. Provided that all unknown namespace strings are mapped to MetricNamespace::Unknown, this would even allow to create config before updating Relay.

I have not fully thought through all implications here, however.

iker-barriocanal · 2022-06-03T12:25:22Z

relay-metrics/src/protocol.rs

-    }
+/// A metric name parsed as MRI, a naming scheme which includes most of the metric's bucket key
+/// (excl. timestamp and tags).
+pub struct MetricMri {


If the name is refactored, I rather MetricResourceIdentifier over Mri.

relay-server/src/actors/store.rs

relay-metrics/src/protocol.rs

iker-barriocanal · 2022-06-08T08:06:42Z

relay-server/src/actors/store.rs

+            KafkaTopic::MetricsSessions => Some(&self.metrics_sessions),
+            KafkaTopic::MetricsTransactions => Some(&self.metrics_transactions),


Question: can these ever be None?

This reverts commit ec63f36.

This reverts commit 44f0baf.

untitaker · 2022-06-08T15:31:32Z

After extensive conversation with @jan-auer, we decided to stick with the current approach, drop all unknown namespaces, and refactor the Metric struct. Please review from scratch.

…topics

jjbayer · 2022-06-08T15:39:55Z

relay-metrics/src/aggregation.rs

-    ///
-    /// See [`Metric::unit`].
-    #[serde(default, skip_serializing_if = "MetricUnit::is_none")]
-    pub unit: MetricUnit,


This struct is serialized and sent over HTTP. Do we expect any compatibility issues?

new relays receiving unit will ignore it (?)

old relays that expect a unit will now log the metric with unit None. But as long as we deploy processing relays first, this should not be a problem.

We checked this through code review yesterday, though it would be good to add these points to the PR description @untitaker.

new relays receiving unit will ignore it

That is correct.

old relays that expect a unit will now log the metric with unit None. But as long as we deploy processing relays first, this should not be a problem.

Since we introduced MRIs initially, there has been a unit in all metrics names we are ingesting so far.

jjbayer · 2022-06-08T15:41:16Z

relay-metrics/src/aggregation.rs

@@ -1410,7 +1422,6 @@ impl Aggregator {
            Entry::Occupied(mut entry) => {
                relay_statsd::metric!(
                    counter(MetricCounters::MergeHit) += 1,
-                    metric_type = entry.key().metric_type.as_str(),


It would be nice if we could still tag this by type.

jjbayer · 2022-06-08T15:42:46Z

relay-metrics/src/aggregation.rs

@@ -780,8 +777,6 @@ struct BucketKey {
    project_key: ProjectKey,
    timestamp: UnixTimestamp,
    metric_name: String,


Should this be of type MetricResourceIdentifier now? Or did we deliberately decide against that to keep the bucket key small(er)?

Having this a plain String here allows us to be graceful in parsing batched messages that contain invalid MRIs and then drop them in an explicit validation step. Most of the aggregator and pipeline do not need to be concerned with the internal structure of an MRI. This is also one of its important design principles.

jjbayer · 2022-06-09T09:20:39Z

relay-metrics/src/aggregation.rs

+
+                let mut metric_name = mri.to_string();
+                // do this so cost tracking still works accurately.
+                metric_name.shrink_to_fit();


Cost tracking uses String::capacity(), so it should still be accurate, but shrinking it sounds like a good idea anyway.

I would not bother with such details here and rather ensure cost tracking is accurate enough (which it is). Otherwise we have to worry about that everywhere in the code base, which is not scalable.

well then I have to change the tests to assert for different values. I'll do that then

shrink_to_fit changes capacity so I think it's appropriate. or we change it to len

I dabbled around with it just now and I think it's more complicated to write a predictable testcase/assertion if shrink_to_fit is removed. Unless there are strong concerns about the perf overhead of doing this i would prefer to keep it.

jjbayer · 2022-06-09T09:26:13Z

relay-metrics/src/aggregation.rs

@@ -1527,6 +1536,7 @@ impl Aggregator {
            let bucket_interval = self.config.bucket_interval;
            let cost_tracker = &mut self.cost_tracker;
            self.buckets.retain(|key, entry| {
+                dbg!(&key);


Suggested change

dbg!(&key);

jjbayer · 2022-06-09T09:31:51Z

relay-metrics/src/protocol.rs

+    ///
+    /// Note that the format used in the statsd protocol is different: Metric names are not prefixed
+    /// with `<ty>:` as the type is somewhere else in the protocol.
+    pub name: String,


nit: Same as for bucket: Should we make this a MetricsResourceIdentifier? With appropriate serializing logic so it still serializes as a string.

Thanks. Closing the conversation here in favor of the one on bucket.

I think we should capture why we use String over MetricsResourceIdentifier in the docstrings, based on this convo.

jan-auer

Structure and approach are great. Thorough review attached.

jan-auer · 2022-06-09T09:51:48Z

relay-server/src/actors/store.rs

+                namespace: MetricNamespace::Sessions,
+                ..
+            }) => KafkaTopic::MetricsSessions,
+            _ => {


Please match explicitly on MetricNamespace::Unsupported here.

fair. there's the error case as well to take care of, since the name is str. so I added that as well

jan-auer · 2022-06-09T09:54:20Z

relay-server/src/actors/store.rs

@@ -374,8 +385,26 @@ impl StoreForwarder {
    }

    fn send_metric_message(&self, message: MetricKafkaMessage) -> Result<(), StoreError> {
+        let topic = match message.name.parse() {


nit: This would be easier to read if you map the namespace here. For instance:

Suggested change

let topic = match message.name.parse() {

let mri: MetricResourceIdentifier = message.name.parse();

let topic = match mri.map(|mri| mri.namespace) {

jan-auer · 2022-06-09T09:55:14Z

relay-server/src/actors/store.rs

+                ..
+            }) => KafkaTopic::MetricsSessions,
+            _ => {
+                relay_log::configure_scope(|scope| {


nit: Rather use with_scope to prevent leaking this into other errors produced adjacently.

jan-auer · 2022-06-09T09:57:14Z

relay-metrics/src/protocol.rs

+    /// The namespace/usecase for this metric. For example `sessions` or `transactions`.
+    pub namespace: MetricNamespace,
+    /// The actual name, such as `duration` as part of `d:transactions/duration@ms`
+    pub name: Cow<'a, str>,


From how we use this, can this always be borrowed? It seems we always ever temporarily construct this, so there's always a string to borrow from.

that's a good point, right now we go through FromStr which prevents us from effectively borrowing. I un-implemented FromStr and am using a struct method now, i don't think there's another way to get the borrowing behavior otherwise

jan-auer · 2022-06-09T09:59:05Z

relay-metrics/src/protocol.rs

-            unit,
+            name: MetricResourceIdentifier {
+                ty: value.ty(),
+                name: name.to_string().into(),


nit: Change the interface of new_mri to prevent the intermediate allocation here. impl AsRef<str> should do.

jan-auer · 2022-06-09T10:10:58Z

relay-metrics/src/protocol.rs

+    Transactions,
+    /// Metrics that relay doesn't know the namespace of, and will drop before aggregating.
+    ///
+    /// We could make this variant contain a string such that customer and PoP-relays can forward


This is a nice code comment if you prefer to keep it but should better not go into public docs. We have plenty of enums following this pattern though.

I shortened this comment to not talk too much about impl

jan-auer · 2022-06-09T10:12:29Z

relay-metrics/src/aggregation.rs

@@ -758,6 +752,9 @@ enum AggregateMetricsErrorKind {
    /// A metric bucket had invalid characters in the metric name.
    #[fail(display = "found invalid characters")]
    InvalidCharacters,
+    /// A metric bucket had an unknown namespace in the metric name.
+    #[fail(display = "found invalid namespace")]


nit: The namespace variant is called Unsupported rather than "invalid", we could align that term. I do not have a preference.

jan-auer · 2022-06-09T10:13:33Z

relay-metrics/src/aggregation.rs

-            });
-            return Err(AggregateMetricsErrorKind::InvalidCharacters.into());
-        }
+        key.metric_name = match key.metric_name.parse::<MetricResourceIdentifier>() {


nit: Since this is quite a block of code that normalizes the MRI, let's introduce a function for that.

jan-auer · 2022-06-09T10:15:19Z

relay-metrics/src/aggregation.rs

+                        key.project_key.as_str().to_owned().into(),
+                    );
+                    scope.set_extra("bucket.metric_name", key.metric_name.into());
+                });


nit: Could we deduplicate scope configuration with the block above, since both only get called in the error case?

i sort of assumed that's too hard wrt borrowing but seems like by brain is still stuck in times before stacked borrows. fixed

jan-auer · 2022-06-09T10:16:42Z

relay-metrics/src/aggregation.rs

+
+                let mut metric_name = mri.to_string();
+                // do this so cost tracking still works accurately.
+                metric_name.shrink_to_fit();


I would not bother with such details here and rather ensure cost tracking is accurate enough (which it is). Otherwise we have to worry about that everywhere in the code base, which is not scalable.

relay-metrics/src/protocol.rs

iker-barriocanal · 2022-06-13T13:37:12Z

relay-metrics/src/protocol.rs

+    /// The namespace/usecase for this metric. For example `sessions` or `transactions`.
+    pub namespace: MetricNamespace,


Since namespaces are optional, I assume we track them as empty strings. If this is the case, I think we should add this to the docstrings.

iker-barriocanal · 2022-06-13T13:59:07Z

relay-metrics/src/protocol.rs

+    ///
+    /// Note that the format used in the statsd protocol is different: Metric names are not prefixed
+    /// with `<ty>:` as the type is somewhere else in the protocol.
+    pub name: String,


I think we should capture why we use String over MetricsResourceIdentifier in the docstrings, based on this convo.

iker-barriocanal · 2022-06-13T15:29:13Z

relay-metrics/src/aggregation.rs

@@ -690,15 +690,10 @@ pub struct Bucket {
    pub timestamp: UnixTimestamp,
    /// The length of the time window in seconds.
    pub width: u64,
-    /// The name of the metric without its unit.
+    /// The MRI (metric resource identifier).
    ///
    /// See [`Metric::name`].
    pub name: String,


Not part of the PR, but if the identifier of a bucket is going to be the MRI, I think we should rename this property.

Unfortunately this is public-facing API so we'd have to add #[serde(rename = "name")] then

iker-barriocanal · 2022-06-14T08:07:40Z

relay-metrics/src/protocol.rs

-    /// The unit can be omitted and defaults to [`MetricUnit::None`].
-    #[serde(default, skip_serializing_if = "MetricUnit::is_none")]
-    pub unit: MetricUnit,
+    /// * **Type:** counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`].


We don't ingest these:

Suggested change

/// * **Type:** counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`].

/// * **Type:** counter (`c`), set (`s`), distribution (`d`), and gauge (`g`). See [`MetricType`].

i updated the docstring to clarify that e is not observed in ingestion, but I kept it here

Co-authored-by: Iker Barriocanal <[email protected]>

relay-metrics/src/protocol.rs

jan is OOO, i think we can merge this

untitaker requested a review from a team June 2, 2022 14:49

add changelog

c27c4df

untitaker changed the title ~~feat(metrics): Support dedicated topics per metrics usecase [INGEST-1309]~~ feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] Jun 2, 2022

jjbayer reviewed Jun 2, 2022

View reviewed changes

relay-metrics/src/protocol.rs Outdated Show resolved Hide resolved

jjbayer approved these changes Jun 2, 2022

View reviewed changes

jan-auer self-requested a review June 2, 2022 16:00

jan-auer reviewed Jun 3, 2022

View reviewed changes

iker-barriocanal reviewed Jun 3, 2022

View reviewed changes

untitaker added 2 commits June 7, 2022 15:23

default to custom namespace, revert tests, rename mri struct

44f0baf

fix tests

ec63f36

iker-barriocanal reviewed Jun 8, 2022

View reviewed changes

untitaker added 6 commits June 8, 2022 12:49

Revert "fix tests"

8153c66

This reverts commit ec63f36.

Revert "default to custom namespace, revert tests, rename mri struct"

fced438

This reverts commit 44f0baf.

wip

937f307

fix tests

abeb4b8

update docs

0f045a2

rename mri

6698a4d

untitaker requested review from jjbayer, iker-barriocanal and jan-auer June 8, 2022 15:31

untitaker added 4 commits June 8, 2022 17:33

Merge remote-tracking branch 'origin/master' into feat/split-metrics-…

3024f80

…topics

remove unused import

85f1cbd

fix tests

2484593

remove unsafe code

24e4462

jjbayer reviewed Jun 9, 2022

View reviewed changes

jan-auer previously requested changes Jun 9, 2022

View reviewed changes

untitaker added 2 commits June 9, 2022 17:34

apply review feedback

1376eab

succumb to clippy dogma

f5bffc3

iker-barriocanal approved these changes Jun 13, 2022

View reviewed changes

iker-barriocanal reviewed Jun 14, 2022

View reviewed changes

untitaker requested a review from jjbayer June 14, 2022 11:07

untitaker and others added 2 commits June 14, 2022 13:08

Update relay-metrics/src/protocol.rs

bc24df3

Co-authored-by: Iker Barriocanal <[email protected]>

apply review feedback

0f81651

jjbayer reviewed Jun 14, 2022

View reviewed changes

relay-metrics/src/protocol.rs Show resolved Hide resolved

relay-metrics/src/protocol.rs Outdated Show resolved Hide resolved

jjbayer approved these changes Jun 14, 2022

View reviewed changes

flub approved these changes Jun 14, 2022

View reviewed changes

Merge branch 'master' into feat/split-metrics-topics

2335b96

untitaker merged commit e593819 into master Jun 14, 2022

untitaker deleted the feat/split-metrics-topics branch June 14, 2022 18:21

		KafkaTopic::MetricsSessions => Some(&self.metrics_sessions),
		KafkaTopic::MetricsTransactions => Some(&self.metrics_transactions),

	let topic = match message.name.parse() {
	let mri: MetricResourceIdentifier = message.name.parse();
	let topic = match mri.map(\|mri\| mri.namespace) {

		/// The namespace/usecase for this metric. For example `sessions` or `transactions`.
		pub namespace: MetricNamespace,

	/// * Type: counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`].
	/// * Type: counter (`c`), set (`s`), distribution (`d`), and gauge (`g`). See [`MetricType`].

feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] #1285

feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] #1285

Conversation

untitaker commented Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

untitaker commented Jun 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-auer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

untitaker commented Jun 2, 2022 •

edited

Loading