Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] #1285

Merged
merged 19 commits into from
Jun 14, 2022

Conversation

untitaker
Copy link
Member

@untitaker untitaker commented Jun 2, 2022

Implement routing metrics to separate topics based on their usecase. We
currently have exactly two usecases (Session- and transaction metrics)
so those are just hardcoded as known usecases. The config format is
backwards-compatible, and by default both usecases still go to the
ingest-metrics topic.

This change is breaking in that metrics from unknown usecases
(outside of release health and metrics-enhanced performance) are dropped
in processing relays.

External and PoP-relays continue to forward arbitrary metrics... as
long as they have a usecase (any string).

As part of that, refactor MRI validation into actual MRI parsing,
which now also means that all metrics have to have a usecase.

]

Implement routing metrics to separate topics based on their usecase. We
currently have exactly two usecases (Session- and transaction metrics)
so those are just hardcoded as known usecases.

As part of that, refactor MRI validation into actual MRI _parsing_,
which now also means that all metrics have to have a usecase.
@untitaker untitaker requested a review from a team June 2, 2022 14:49
@untitaker untitaker changed the title feat(metrics): Support dedicated topics per metrics usecase [INGEST-1309] feat(metrics): Support dedicated topics per metrics usecase, drop metrics from unknown usecases [INGEST-1309] Jun 2, 2022
@jan-auer jan-auer self-requested a review June 2, 2022 16:00
}
/// A metric name parsed as MRI, a naming scheme which includes most of the metric's bucket key
/// (excl. timestamp and tags).
pub struct MetricMri {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MRI stands for "MetricResourceIdentifier", so the "MetricMri" is redundant. The acronym is unique enough to just call it Mri, or fully type it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the name is refactored, I rather MetricResourceIdentifier over Mri.

/// Topic name for metrics extracted from sessions. Defaults to the assignment of `metrics`.
pub metrics_sessions: Option<TopicAssignment>,
/// Topic name for metrics extracted from transactions. Defaults to the assignment of `metrics`.
pub metrics_transactions: Option<TopicAssignment>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we introduce a MetricNamespace enum, this could become a mapping of MetricNamespace => TopicAssignment, which would allow for more generic code in all other places. Provided that all unknown namespace strings are mapped to MetricNamespace::Unknown, this would even allow to create config before updating Relay.

I have not fully thought through all implications here, however.

}
/// A metric name parsed as MRI, a naming scheme which includes most of the metric's bucket key
/// (excl. timestamp and tags).
pub struct MetricMri {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the name is refactored, I rather MetricResourceIdentifier over Mri.

Comment on lines +80 to +81
KafkaTopic::MetricsSessions => Some(&self.metrics_sessions),
KafkaTopic::MetricsTransactions => Some(&self.metrics_transactions),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: can these ever be None?

@untitaker
Copy link
Member Author

After extensive conversation with @jan-auer, we decided to stick with the current approach, drop all unknown namespaces, and refactor the Metric struct. Please review from scratch.

///
/// See [`Metric::unit`].
#[serde(default, skip_serializing_if = "MetricUnit::is_none")]
pub unit: MetricUnit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is serialized and sent over HTTP. Do we expect any compatibility issues?

  • new relays receiving unit will ignore it (?)
  • old relays that expect a unit will now log the metric with unit None. But as long as we deploy processing relays first, this should not be a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We checked this through code review yesterday, though it would be good to add these points to the PR description @untitaker.

  • new relays receiving unit will ignore it

That is correct.

  • old relays that expect a unit will now log the metric with unit None. But as long as we deploy processing relays first, this should not be a problem.

Since we introduced MRIs initially, there has been a unit in all metrics names we are ingesting so far.

@@ -1410,7 +1422,6 @@ impl Aggregator {
Entry::Occupied(mut entry) => {
relay_statsd::metric!(
counter(MetricCounters::MergeHit) += 1,
metric_type = entry.key().metric_type.as_str(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could still tag this by type.

@@ -780,8 +777,6 @@ struct BucketKey {
project_key: ProjectKey,
timestamp: UnixTimestamp,
metric_name: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be of type MetricResourceIdentifier now? Or did we deliberately decide against that to keep the bucket key small(er)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this a plain String here allows us to be graceful in parsing batched messages that contain invalid MRIs and then drop them in an explicit validation step. Most of the aggregator and pipeline do not need to be concerned with the internal structure of an MRI. This is also one of its important design principles.


let mut metric_name = mri.to_string();
// do this so cost tracking still works accurately.
metric_name.shrink_to_fit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cost tracking uses String::capacity(), so it should still be accurate, but shrinking it sounds like a good idea anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not bother with such details here and rather ensure cost tracking is accurate enough (which it is). Otherwise we have to worry about that everywhere in the code base, which is not scalable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well then I have to change the tests to assert for different values. I'll do that then

shrink_to_fit changes capacity so I think it's appropriate. or we change it to len

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dabbled around with it just now and I think it's more complicated to write a predictable testcase/assertion if shrink_to_fit is removed. Unless there are strong concerns about the perf overhead of doing this i would prefer to keep it.

@@ -1527,6 +1536,7 @@ impl Aggregator {
let bucket_interval = self.config.bucket_interval;
let cost_tracker = &mut self.cost_tracker;
self.buckets.retain(|key, entry| {
dbg!(&key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dbg!(&key);

///
/// Note that the format used in the statsd protocol is different: Metric names are not prefixed
/// with `<ty>:` as the type is somewhere else in the protocol.
pub name: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same as for bucket: Should we make this a MetricsResourceIdentifier? With appropriate serializing logic so it still serializes as a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Closing the conversation here in favor of the one on bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should capture why we use String over MetricsResourceIdentifier in the docstrings, based on this convo.

Copy link
Member

@jan-auer jan-auer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure and approach are great. Thorough review attached.

namespace: MetricNamespace::Sessions,
..
}) => KafkaTopic::MetricsSessions,
_ => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please match explicitly on MetricNamespace::Unsupported here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair. there's the error case as well to take care of, since the name is str. so I added that as well

@@ -374,8 +385,26 @@ impl StoreForwarder {
}

fn send_metric_message(&self, message: MetricKafkaMessage) -> Result<(), StoreError> {
let topic = match message.name.parse() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This would be easier to read if you map the namespace here. For instance:

Suggested change
let topic = match message.name.parse() {
let mri: MetricResourceIdentifier = message.name.parse();
let topic = match mri.map(|mri| mri.namespace) {

..
}) => KafkaTopic::MetricsSessions,
_ => {
relay_log::configure_scope(|scope| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rather use with_scope to prevent leaking this into other errors produced adjacently.

/// The namespace/usecase for this metric. For example `sessions` or `transactions`.
pub namespace: MetricNamespace,
/// The actual name, such as `duration` as part of `d:transactions/duration@ms`
pub name: Cow<'a, str>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From how we use this, can this always be borrowed? It seems we always ever temporarily construct this, so there's always a string to borrow from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, right now we go through FromStr which prevents us from effectively borrowing. I un-implemented FromStr and am using a struct method now, i don't think there's another way to get the borrowing behavior otherwise

unit,
name: MetricResourceIdentifier {
ty: value.ty(),
name: name.to_string().into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Change the interface of new_mri to prevent the intermediate allocation here. impl AsRef<str> should do.

Transactions,
/// Metrics that relay doesn't know the namespace of, and will drop before aggregating.
///
/// We could make this variant contain a string such that customer and PoP-relays can forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice code comment if you prefer to keep it but should better not go into public docs. We have plenty of enums following this pattern though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shortened this comment to not talk too much about impl

@@ -758,6 +752,9 @@ enum AggregateMetricsErrorKind {
/// A metric bucket had invalid characters in the metric name.
#[fail(display = "found invalid characters")]
InvalidCharacters,
/// A metric bucket had an unknown namespace in the metric name.
#[fail(display = "found invalid namespace")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The namespace variant is called Unsupported rather than "invalid", we could align that term. I do not have a preference.

});
return Err(AggregateMetricsErrorKind::InvalidCharacters.into());
}
key.metric_name = match key.metric_name.parse::<MetricResourceIdentifier>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this is quite a block of code that normalizes the MRI, let's introduce a function for that.

key.project_key.as_str().to_owned().into(),
);
scope.set_extra("bucket.metric_name", key.metric_name.into());
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we deduplicate scope configuration with the block above, since both only get called in the error case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i sort of assumed that's too hard wrt borrowing but seems like by brain is still stuck in times before stacked borrows. fixed


let mut metric_name = mri.to_string();
// do this so cost tracking still works accurately.
metric_name.shrink_to_fit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not bother with such details here and rather ensure cost tracking is accurate enough (which it is). Otherwise we have to worry about that everywhere in the code base, which is not scalable.

Comment on lines 224 to 225
/// The namespace/usecase for this metric. For example `sessions` or `transactions`.
pub namespace: MetricNamespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since namespaces are optional, I assume we track them as empty strings. If this is the case, I think we should add this to the docstrings.

///
/// Note that the format used in the statsd protocol is different: Metric names are not prefixed
/// with `<ty>:` as the type is somewhere else in the protocol.
pub name: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should capture why we use String over MetricsResourceIdentifier in the docstrings, based on this convo.

@@ -690,15 +690,10 @@ pub struct Bucket {
pub timestamp: UnixTimestamp,
/// The length of the time window in seconds.
pub width: u64,
/// The name of the metric without its unit.
/// The MRI (metric resource identifier).
///
/// See [`Metric::name`].
pub name: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the PR, but if the identifier of a bucket is going to be the MRI, I think we should rename this property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is public-facing API so we'd have to add #[serde(rename = "name")] then

/// The unit can be omitted and defaults to [`MetricUnit::None`].
#[serde(default, skip_serializing_if = "MetricUnit::is_none")]
pub unit: MetricUnit,
/// * **Type:** counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't ingest these:

Suggested change
/// * **Type:** counter (`c`), set (`s`), distribution (`d`), gauge (`g`), and evaluated (`e`) for derived numeric metrics. See [`MetricType`].
/// * **Type:** counter (`c`), set (`s`), distribution (`d`), and gauge (`g`). See [`MetricType`].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i updated the docstring to clarify that e is not observed in ingestion, but I kept it here

@untitaker untitaker requested a review from jjbayer June 14, 2022 11:07
@untitaker untitaker dismissed jan-auer’s stale review June 14, 2022 18:20

jan is OOO, i think we can merge this

@untitaker untitaker merged commit e593819 into master Jun 14, 2022
@untitaker untitaker deleted the feat/split-metrics-topics branch June 14, 2022 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants