-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return an http error during scraping if metrics collide when escaped to underscores #1641
base: main
Are you sure you want to change the base?
Conversation
So the edge case here is:
This should collide, but because the first metric's hash was Compat mode, it doesn't match the hash of the second metric, and we don't detect the collision. I think this is another case where we can say "don't do that", but I'm not sure how best to define what "that" is. In general, I think users should not define metrics in init() functions. OR, we could say, don't define UTF-8 metrics in init functions. |
Another option is to store hashes for both UTF-8 and compat. This adds 64bits per metric and another collectorsByID map (since we'll need to store both the compat and utf8 collector ids depending on the mode). don't love that either |
I am leaning towards having a second map in the registerer that only checks for compat fqname+labelname collisions. It's a memory increase but it gets around all these issues. |
prometheus/desc.go
Outdated
@@ -140,15 +142,45 @@ func (v2) NewDesc(fqName, help string, variableLabels ConstrainableLabels, const | |||
d.err = fmt.Errorf("duplicate label names in constant and variable labels for metric %q", fqName) | |||
return d | |||
} | |||
|
|||
d.id, d.dimHash = makeHashes(labelNames, labelValues, help, true) | |||
d.compatID, d.compatDimHash = makeHashes(labelNames, labelValues, help, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this mutates labelNames so it's not pretty
prometheus/registry.go
Outdated
uncheckedCollectors []Collector | ||
pedanticChecksEnabled bool | ||
utf8Collision bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're adding a uint64, empty struct, and Collector interface obj for every metric registered. This could be a lot of memory with many millions of metrics. Perhaps we could only store the compat IDs if they are different from the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have now implemented this)
prometheus/registry.go
Outdated
@@ -391,6 +481,10 @@ func (r *Registry) Unregister(c Collector) bool { | |||
defer r.mtx.Unlock() | |||
|
|||
delete(r.collectorsByID, collectorID) | |||
if _, exists := r.collectorsByCompatID[compatID]; !exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is other cleanup needed
I think this new approach is a decent solution. We can flip UTF-8 mode on and off at will and we don't pay a huge memory or performance penalty (that I can see) |
prometheus/desc.go
Outdated
@@ -57,6 +57,8 @@ type Desc struct { | |||
// must be unique among all registered descriptors and can therefore be | |||
// used as an identifier of the descriptor. | |||
id uint64 | |||
// compatID is similar to id, but is a hash of all the relevant names escaped with underscores. | |||
compatID uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first saw this variable I thought it was a typo for compactID, but I see you're are using compact is other places as well.
What does compat means? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And still related to this variable, have we thought about using id
to store the hash of the already normalized descriptor(instead of creating a new hash)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"compatible." as in, "compatible with legacy systems." I need to pick better names for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we thought about using id to store the hash of the already normalized descriptor?
do you mean only storing the normalized descriptor? We need both in case the user switches the mode to UTF-8 comparison with no legacy compatibility. Or do you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant just storing the normalized descriptor. Good point that's it's possible to change the validation in-flight!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why someone would like to do that though 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the record (as discussed in slack), it's because metrics are often created in init()
functions before the user has a chance to actually configure the correct escaping scheme. So we have to be ready for both scenarios.
b08de89
to
8e9ef8c
Compare
97cad93
to
118a3c1
Compare
prometheus/promhttp/http.go
Outdated
if hasEscapedCollisions { | ||
switch contentType.ToEscapingScheme() { | ||
case model.UnderscoreEscaping, model.DotsEscaping: | ||
opts.ErrorLog.Println("error: one or more metrics collide when escaped") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a better error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something like
"One or more metrics collide when UTF-8 characters in name/label are translated to underscores"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Since we always allow registration regardless of collision, I would just move the test closer to the HTTP package. We can keep the same test cases you've added, but the test would look similar to this:
func TestEscapedCollisions(t *testing.T) {
reg := prometheus.NewRegistry()
reg.MustRegister(prometheus.NewCounter(prometheus.CounterOpts{
Name: "test_metric",
Help: "A test metric with underscores",
}))
reg.MustRegister(prometheus.NewCounter(prometheus.CounterOpts{
Name: "test.metric",
Help: "A test metric with dots",
}))
handler := HandlerFor(reg, HandlerOpts{})
writer := httptest.NewRecorder()
request, _ := http.NewRequest("GET", "/metrics", nil)
request.Header.Add(acceptEncodingHeader, "the header that requires escaping")
handler.ServeHTTP(writer, request)
require.Equal(t, "some http error", writer.Code)
request, _ := http.NewRequest("GET", "/metrics", nil)
request.Header.Add(acceptEncodingHeader, "the header that allows utf8")
handler.ServeHTTP(writer, request)
require.Equal(t, "no erro error", writer.Code)
}
I'd like to keep the current unit test since the implementation is a little weird, but I will add this too. |
I'll squash this down one we're done with notes |
prometheus/promhttp/http_test.go
Outdated
Help: "A test metric with dots", | ||
})) | ||
|
||
handler := HandlerFor(reg, HandlerOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the sucky part, it's necessary to specify the registry here or else we don't detect the collision case (because Gatherer does not know about collisions)
aad8c01
to
7f3763c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just one comment about adding another dependency, sorry about that 😬
prometheus/registry_test.go
Outdated
require.NoError(t, err) | ||
err = reg.Register(tc.counterB()) | ||
if !tc.expectErr { | ||
require.NoError(t, err) | ||
} else { | ||
require.Error(t, err) | ||
} | ||
require.Equal(t, tc.expectLegacyCollision, reg.HasEscapedCollision()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for creating the confusion here, that's my bad! Bartek made a good point for not using require here, would you mind reverting the change? 🙈
We always have problems when we introduce new dependencies, let's avoid it if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
…to underscores Signed-off-by: Owen Williams <[email protected]>
7f3763c
to
42825b6
Compare
Signed-off-by: Owen Williams <[email protected]>
@@ -134,6 +136,7 @@ func HandlerForTransactional(reg prometheus.TransactionalGatherer, opts HandlerO | |||
} | |||
} | |||
} | |||
hasEscapedCollisions = reg.HasEscapedCollision() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... should I use the singular or the plural? :)
Help: "A test metric with dots", | ||
})) | ||
|
||
handler := HandlerFor(reg, HandlerOpts{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tada
@@ -194,6 +205,10 @@ func (gf GathererFunc) Gather() ([]*dto.MetricFamily, error) { | |||
return gf() | |||
} | |||
|
|||
func (gf GathererFunc) HasEscapedCollision() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the most risky change. both mimir and avalanche use gathererfunc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, avalanche doesn't use it
https://github.com/grafana/mimir/blob/main/pkg/api/handlers.go#L279
@@ -57,6 +57,9 @@ type Desc struct { | |||
// must be unique among all registered descriptors and can therefore be | |||
// used as an identifier of the descriptor. | |||
id uint64 | |||
// escapedID is similar to id, but with the metric and label names escaped | |||
// with underscores. | |||
escapedID uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use id
for escaped ID?
} | ||
d.id = xxh.Sum64() | ||
d.escapedID = escapedXXH.Sum64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, why not use id as escaped ID always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
I think I would love to understand this change more before committing here. Asked some likely stupid questions on the issue #1638 (comment)
We never did those kind of checks on HTTP handler only, only on registration layer, I wonder if it's worth to keep that. Having your app working fine and suddenly /metrics are erroring only because 2 series collide is bit unexpected (especially by default) and wasteful, scrape could just ignore one problematic metric.
If anything we could start with a special Gatherer that checks that, without changes to descriptor ever, and see who would actually use it (who would like to opt-in for extra safety here). I would also be happy to consider adding this logic on-registry check/error when escaped metric name is colliding OR only check escaped ones (is there a case when escaped don't collide, but non-escaped collide?)
Is it worth to double check those first @ArthurSens ?
We are going to put this on hold and have a meeting to discuss the various approaches and their pros and cons |
fixes #1638
This change prevents a possible issue when UTF-8 metrics are scraped by a system that does not support UTF-8. The metric names will be escaped, and if two metrics escape to the same legacy name, they will appear in the exposition "twice", causing collision problems. Therefore, during scraping we return an http error if two metrics escape to the same legacy string.
This change creates a new uint64 per metric and two new maps in the Registry. The new maps are only populated if the hash of the original metric data contains UTF-8 data. This means memory usage will be lower in the near term and will go up in the long term or for OTLP-heavy users. Eventually (i.e. Prom 4.0 or later) we could deprecate the compatibility modes with legacy systems to save this memory, but it's also not clear to me that the memory usage will be that high.
To avoid performance issues checking for collisions on every scrape, the actual collision detection is done at registration time but only reported at scrape time.