Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having duplicate analysis entries will corrupt /metrics end-point #2374

Open
2 tasks done
MarkSRobinson opened this issue Oct 26, 2022 · 5 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@MarkSRobinson
Copy link
Contributor

MarkSRobinson commented Oct 26, 2022

Checklist:

  • I've included steps to reproduce the bug.
  • I've inclued the version of argo rollouts.

Describe the bug

If a user accidentally creates two Analysis' with duplicate names, it will corrupt the metrics such that Argo gets put into a broken state. The /metrics endpoint will return an error, this will cause k8s to put into an unready state.

To Reproduce

Create an AnalysisTemplate with duplicated analyses (https://gist.github.com/MarkSRobinson/7203d76d265a8217919662c1b57eb9cb). This will cause the metrics endpoint to become and stay broken.

Just kubectl apply -f the supplied file and argo rollouts will have a broken metrics endpoint.

Restarting argo does not fix it.

Deleting the analysis template does fix it.

Expected behavior

Posting an invalid analysis template should not break the /metrics endpoint.

Screenshots

Output from /metrics

An error has occurred while serving metrics:

[from Gatherer #1] collected metric "analysis_template_metric_info" { label:<name:"metric" value:"configpush_ticks" > label:<name:"name" value:"configpush-analysis" > label:<name:"namespace" value:"preview-configpush-57719" > label:<name:"type" value:"Prometheus" > gauge:<value:1 > } was collected before with the same name and label values

Version

1.3.1

Logs

time="2022-10-26T18:12:18Z" level=info msg="Started syncing rollout" generation=1 namespace=preview-configpush-57719 resourceVersion=243750927 rollout=configpush
time="2022-10-26T18:12:18Z" level=error msg="The Rollout \"configpush\" is invalid: spec.strategy.canary.analysis.templates: Invalid value: \"templateNames: [configpush-analysis]\": two metrics have the same name 'configpush_ticks'" namespace=preview-configpush-57719 rollout=configpush
time="2022-10-26T18:12:18Z" level=info msg="Reconciliation completed" generation=1 namespace=preview-configpush-57719 resourceVersion=243750927 rollout=configpush time_ms=1.94425

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@MarkSRobinson MarkSRobinson added the bug Something isn't working label Oct 26, 2022
@zachaller zachaller added this to the v1.4 milestone Oct 27, 2022
@zachaller zachaller removed this from the v1.4 milestone Dec 21, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity.

@pims
Copy link

pims commented Oct 31, 2023

We're running into this issue as well. We're currently evaluating options on how to best solve this issue.

@pims
Copy link

pims commented Oct 31, 2023

A quick untested fix for this very specific issue could be:

diff --git a/controller/metrics/analysis.go b/controller/metrics/analysis.go
index 12480b76..f4bcc67d 100644
--- a/controller/metrics/analysis.go
+++ b/controller/metrics/analysis.go
@@ -107,8 +107,17 @@ func collectAnalysisTemplate(ch chan<- prometheus.Metric, namespace, name string
 	}
 	addGauge(MetricAnalysisTemplateInfo, 1)
 
-	for _, metric := range at.Metrics {
+	seen := make(map[string]struct{}, len(at.Metrics))
+	for i, metric := range at.Metrics {
 		metricType := metricproviders.Type(metric)
-		addGauge(MetricAnalysisTemplateMetricInfo, 1, metricType, metric.Name)
+		metricName := metric.Name
+		_, found := seen[metricName]
+		if found {
+			metricName = fmt.Sprintf("%s-%d", metricName, i)
+		} else {
+			seen[metricName] = struct{}{}
+		}
+
+		addGauge(MetricAnalysisTemplateMetricInfo, 1, metricType, metricName)
 	}
 }

but looking at the bigger picture, #1946 is probably the best long term fix.
@zachaller is this a feature you'd be willing to consider contributions for?

@zachaller
Copy link
Collaborator

@pims I would be interested in either a fix for this specific issue or the bigger fix of moving validation logic to a admission webhook, are you interested in contributing?

@pims
Copy link

pims commented Nov 1, 2023

Let's start with the fix for this issue and discuss a plan for moving the validation logic to an admission webhook in #1946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants