-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] #1285
Conversation
] Implement routing metrics to separate topics based on their usecase. We currently have exactly two usecases (Session- and transaction metrics) so those are just hardcoded as known usecases. As part of that, refactor MRI validation into actual MRI _parsing_, which now also means that all metrics have to have a usecase.
relay-metrics/src/protocol.rs
Outdated
} | ||
/// A metric name parsed as MRI, a naming scheme which includes most of the metric's bucket key | ||
/// (excl. timestamp and tags). | ||
pub struct MetricMri { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MRI stands for "MetricResourceIdentifier", so the "MetricMri" is redundant. The acronym is unique enough to just call it Mri
, or fully type it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the name is refactored, I rather MetricResourceIdentifier
over Mri
.
/// Topic name for metrics extracted from sessions. Defaults to the assignment of `metrics`. | ||
pub metrics_sessions: Option<TopicAssignment>, | ||
/// Topic name for metrics extracted from transactions. Defaults to the assignment of `metrics`. | ||
pub metrics_transactions: Option<TopicAssignment>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If we introduce a MetricNamespace enum, this could become a mapping of MetricNamespace
=> TopicAssignment
, which would allow for more generic code in all other places. Provided that all unknown namespace strings are mapped to MetricNamespace::Unknown
, this would even allow to create config before updating Relay.
I have not fully thought through all implications here, however.
relay-metrics/src/protocol.rs
Outdated
} | ||
/// A metric name parsed as MRI, a naming scheme which includes most of the metric's bucket key | ||
/// (excl. timestamp and tags). | ||
pub struct MetricMri { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the name is refactored, I rather MetricResourceIdentifier
over Mri
.
KafkaTopic::MetricsSessions => Some(&self.metrics_sessions), | ||
KafkaTopic::MetricsTransactions => Some(&self.metrics_transactions), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: can these ever be None
?
After extensive conversation with @jan-auer, we decided to stick with the current approach, drop all unknown namespaces, and refactor the Metric struct. Please review from scratch. |
/// | ||
/// See [`Metric::unit`]. | ||
#[serde(default, skip_serializing_if = "MetricUnit::is_none")] | ||
pub unit: MetricUnit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct is serialized and sent over HTTP. Do we expect any compatibility issues?
- new relays receiving
unit
will ignore it (?) - old relays that expect a unit will now log the metric with unit
None
. But as long as we deploy processing relays first, this should not be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We checked this through code review yesterday, though it would be good to add these points to the PR description @untitaker.
- new relays receiving
unit
will ignore it
That is correct.
- old relays that expect a unit will now log the metric with unit
None
. But as long as we deploy processing relays first, this should not be a problem.
Since we introduced MRIs initially, there has been a unit in all metrics names we are ingesting so far.
@@ -1410,7 +1422,6 @@ impl Aggregator { | |||
Entry::Occupied(mut entry) => { | |||
relay_statsd::metric!( | |||
counter(MetricCounters::MergeHit) += 1, | |||
metric_type = entry.key().metric_type.as_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could still tag this by type.
@@ -780,8 +777,6 @@ struct BucketKey { | |||
project_key: ProjectKey, | |||
timestamp: UnixTimestamp, | |||
metric_name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be of type MetricResourceIdentifier
now? Or did we deliberately decide against that to keep the bucket key small(er)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this a plain String here allows us to be graceful in parsing batched messages that contain invalid MRIs and then drop them in an explicit validation step. Most of the aggregator and pipeline do not need to be concerned with the internal structure of an MRI. This is also one of its important design principles.
|
||
let mut metric_name = mri.to_string(); | ||
// do this so cost tracking still works accurately. | ||
metric_name.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cost tracking uses String::capacity()
, so it should still be accurate, but shrinking it sounds like a good idea anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not bother with such details here and rather ensure cost tracking is accurate enough (which it is). Otherwise we have to worry about that everywhere in the code base, which is not scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well then I have to change the tests to assert for different values. I'll do that then
shrink_to_fit changes capacity so I think it's appropriate. or we change it to len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dabbled around with it just now and I think it's more complicated to write a predictable testcase/assertion if shrink_to_fit is removed. Unless there are strong concerns about the perf overhead of doing this i would prefer to keep it.
relay-metrics/src/aggregation.rs
Outdated
@@ -1527,6 +1536,7 @@ impl Aggregator { | |||
let bucket_interval = self.config.bucket_interval; | |||
let cost_tracker = &mut self.cost_tracker; | |||
self.buckets.retain(|key, entry| { | |||
dbg!(&key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbg!(&key); |
/// | ||
/// Note that the format used in the statsd protocol is different: Metric names are not prefixed | ||
/// with `<ty>:` as the type is somewhere else in the protocol. | ||
pub name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Same as for bucket: Should we make this a MetricsResourceIdentifier
? With appropriate serializing logic so it still serializes as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Closing the conversation here in favor of the one on bucket
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should capture why we use String
over MetricsResourceIdentifier
in the docstrings, based on this convo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure and approach are great. Thorough review attached.
relay-server/src/actors/store.rs
Outdated
namespace: MetricNamespace::Sessions, | ||
.. | ||
}) => KafkaTopic::MetricsSessions, | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please match explicitly on MetricNamespace::Unsupported
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair. there's the error case as well to take care of, since the name is str. so I added that as well
relay-server/src/actors/store.rs
Outdated
@@ -374,8 +385,26 @@ impl StoreForwarder { | |||
} | |||
|
|||
fn send_metric_message(&self, message: MetricKafkaMessage) -> Result<(), StoreError> { | |||
let topic = match message.name.parse() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This would be easier to read if you map the namespace
here. For instance:
let topic = match message.name.parse() { | |
let mri: MetricResourceIdentifier = message.name.parse(); | |
let topic = match mri.map(|mri| mri.namespace) { |
relay-server/src/actors/store.rs
Outdated
.. | ||
}) => KafkaTopic::MetricsSessions, | ||
_ => { | ||
relay_log::configure_scope(|scope| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rather use with_scope
to prevent leaking this into other errors produced adjacently.
relay-metrics/src/protocol.rs
Outdated
/// The namespace/usecase for this metric. For example `sessions` or `transactions`. | ||
pub namespace: MetricNamespace, | ||
/// The actual name, such as `duration` as part of `d:transactions/duration@ms` | ||
pub name: Cow<'a, str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From how we use this, can this always be borrowed? It seems we always ever temporarily construct this, so there's always a string to borrow from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point, right now we go through FromStr which prevents us from effectively borrowing. I un-implemented FromStr and am using a struct method now, i don't think there's another way to get the borrowing behavior otherwise
relay-metrics/src/protocol.rs
Outdated
unit, | ||
name: MetricResourceIdentifier { | ||
ty: value.ty(), | ||
name: name.to_string().into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Change the interface of new_mri
to prevent the intermediate allocation here. impl AsRef<str>
should do.
relay-metrics/src/protocol.rs
Outdated
Transactions, | ||
/// Metrics that relay doesn't know the namespace of, and will drop before aggregating. | ||
/// | ||
/// We could make this variant contain a string such that customer and PoP-relays can forward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice code comment if you prefer to keep it but should better not go into public docs. We have plenty of enums following this pattern though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shortened this comment to not talk too much about impl
relay-metrics/src/aggregation.rs
Outdated
@@ -758,6 +752,9 @@ enum AggregateMetricsErrorKind { | |||
/// A metric bucket had invalid characters in the metric name. | |||
#[fail(display = "found invalid characters")] | |||
InvalidCharacters, | |||
/// A metric bucket had an unknown namespace in the metric name. | |||
#[fail(display = "found invalid namespace")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The namespace variant is called Unsupported
rather than "invalid", we could align that term. I do not have a preference.
relay-metrics/src/aggregation.rs
Outdated
}); | ||
return Err(AggregateMetricsErrorKind::InvalidCharacters.into()); | ||
} | ||
key.metric_name = match key.metric_name.parse::<MetricResourceIdentifier>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since this is quite a block of code that normalizes the MRI, let's introduce a function for that.
relay-metrics/src/aggregation.rs
Outdated
key.project_key.as_str().to_owned().into(), | ||
); | ||
scope.set_extra("bucket.metric_name", key.metric_name.into()); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we deduplicate scope configuration with the block above, since both only get called in the error case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i sort of assumed that's too hard wrt borrowing but seems like by brain is still stuck in times before stacked borrows. fixed
|
||
let mut metric_name = mri.to_string(); | ||
// do this so cost tracking still works accurately. | ||
metric_name.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not bother with such details here and rather ensure cost tracking is accurate enough (which it is). Otherwise we have to worry about that everywhere in the code base, which is not scalable.
relay-metrics/src/protocol.rs
Outdated
/// The namespace/usecase for this metric. For example `sessions` or `transactions`. | ||
pub namespace: MetricNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since namespaces are optional, I assume we track them as empty strings. If this is the case, I think we should add this to the docstrings.
/// | ||
/// Note that the format used in the statsd protocol is different: Metric names are not prefixed | ||
/// with `<ty>:` as the type is somewhere else in the protocol. | ||
pub name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should capture why we use String
over MetricsResourceIdentifier
in the docstrings, based on this convo.
@@ -690,15 +690,10 @@ pub struct Bucket { | |||
pub timestamp: UnixTimestamp, | |||
/// The length of the time window in seconds. | |||
pub width: u64, | |||
/// The name of the metric without its unit. | |||
/// The MRI (metric resource identifier). | |||
/// | |||
/// See [`Metric::name`]. | |||
pub name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not part of the PR, but if the identifier of a bucket is going to be the MRI, I think we should rename this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is public-facing API so we'd have to add #[serde(rename = "name")]
then
relay-metrics/src/protocol.rs
Outdated
/// The unit can be omitted and defaults to [`MetricUnit::None`]. | ||
#[serde(default, skip_serializing_if = "MetricUnit::is_none")] | ||
pub unit: MetricUnit, | ||
/// * **Type:** counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't ingest these:
/// * **Type:** counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`]. | |
/// * **Type:** counter (`c`), set (`s`), distribution (`d`), and gauge (`g`). See [`MetricType`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i updated the docstring to clarify that e
is not observed in ingestion, but I kept it here
Co-authored-by: Iker Barriocanal <[email protected]>
jan is OOO, i think we can merge this
Implement routing metrics to separate topics based on their usecase. We
currently have exactly two usecases (Session- and transaction metrics)
so those are just hardcoded as known usecases. The config format is
backwards-compatible, and by default both usecases still go to the
ingest-metrics topic.
This change is breaking in that metrics from unknown usecases
(outside of release health and metrics-enhanced performance) are dropped
in processing relays.
External and PoP-relays continue to forward arbitrary metrics... as
long as they have a usecase (any string).
As part of that, refactor MRI validation into actual MRI parsing,
which now also means that all metrics have to have a usecase.