Document possible race conditions during federation #514

juliusv · 2016-08-11T22:56:50Z

See prometheus/prometheus#1887 (comment)

brian-brazil · 2016-08-11T23:02:35Z

This isn't specific to federation, this is the case with any query.

juliusv · 2016-08-11T23:13:21Z

It's much more relevant for federation because now you can get the situation where some metrics of a target are a scrape interval ahead of other metrics of the same target for a long while and that gets persisted.

brian-brazil · 2016-08-11T23:20:54Z

The same can happen if a recording rule is running during a scrape, which is a more supported use case.

juliusv · 2016-08-11T23:35:47Z

Good point. Maybe this should be documented for both cases then...

brian-brazil · 2016-08-11T23:40:59Z

It's a more general execution model thing, nothing is atomic.

juliusv · 2016-08-11T23:46:59Z

Yeah, but federation and rules are the only places I can think of where there can be such a long delay which is persisted, no? In normal graphs, you could only have that kind of glitch on the far right (now) end, where data is still coming in. But if you use that far right end to persist new data, then you have this problem.

brian-brazil · 2016-08-12T07:27:09Z

You have the exact same with recording rules.

beorn7 · 2016-08-15T08:49:59Z

Idea to solve the issue for both cases:

Maintain an ingestion watermark timestamp, i.e. the latest timestamp that has been ingested to completion.
Only accept samples for rule evaluation and federation that are at or before that timestamp.
The "fast path" of returning the latest sample can still be used, but if the timestamp is after the watermark (which should be rare), we dive into the chunk and get an earlier sample.

If you think this is feasible, we can open an issue in prometheus/prometheus.

fabxc · 2016-08-15T08:53:38Z

In the end this goes back to SampleAppender accepting a slice of samples (
https://github.com/prometheus/prometheus/blob/master/storage/storage.go#L29),
which is made visible for read atomically. We had this discussion in the
past and I'm strongly for it.
There's really no point in an additional "fast-path" exposing partially
scraped data.

On Mon, Aug 15, 2016 at 10:50 AM Björn Rabenstein [email protected]
wrote:

Idea to solve the issue for both cases:

Maintain an ingestion watermark timestamp, i.e. the latest timestamp
that has been ingested to completion.

Only accept samples for rule evaluation and federation that are at
or before that timestamp.

The "fast path" of returning the latest sample can still be used,
but if the timestamp is after the watermark (which should be rare), we dive
into the chunk and get an earlier sample.

If you think this is feasible, we can open an issue in
prometheus/prometheus.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#514 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8hGtMoLNW6OQbe0o4_NVyHP2q0dPks5qgCg6gaJpZM4Jiome
.

beorn7 · 2016-08-15T09:04:49Z

In the end this goes back to SampleAppender accepting a slice of samples (
https://github.com/prometheus/prometheus/blob/master/storage/storage.go#L29),
which is made visible for read atomically. We had this discussion in the
past and I'm strongly for it.

Sadly, changing the interface doesn't magically make ingestion atomic.

My point in the discussion was that such an interface would suggest atomicity, which we cannot provide at the moment. Should we able to do so, I'm all up for such an interface.

There's really no point in an additional "fast-path" exposing partially
scraped data.

The fast path is already there.

fabxc · 2016-08-15T09:17:15Z

Hence the second part of the sentence. Of course it needs support by the
storage behind. But isn't the watermark suggestion you made exactly that –
and it in return, would be weird to implement without being able to scrape
sets of samples.

Yes, but if we added support for watermarks/atomic scrapes, I doubt it
would be convenient to keep the current behavior around. It would probably
have implementation overhead, and Prometheus doesn't operate at a level
where we cannot wait for all samples to be stored for a second or two.

On Mon, Aug 15, 2016 at 11:04 AM Björn Rabenstein [email protected]
wrote:

In the end this goes back to SampleAppender accepting a slice of samples (
https://github.com/prometheus/prometheus/blob/master/storage/storage.go#L29
),
which is made visible for read atomically. We had this discussion in the
past and I'm strongly for it.

Sadly, changing the interface doesn't magically make ingestion atomic.

My point in the discussion was that such an interface would suggest
atomicity, which we cannot provide at the moment. Should we able to do so,
I'm all up for such an interface.

There's really no point in an additional "fast-path" exposing partially
scraped data.

The fast path is already there.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#514 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8kH4TErNjYyb9FOmJF8eXOT6hw2pks5qgCuygaJpZM4Jiome
.

brian-brazil · 2016-08-15T09:21:40Z

This may fall out of prometheus/prometheus#398 as the data required is the same.

beorn7 · 2016-08-15T10:33:04Z

Atomicity is on the one hand a much stronger requirement than what I proposed here (e.g. in terms of handling an error halfway through the slice of samples). On the other hand, it wouldn't solve the problem at hand. The watermark would need to be wired to the setting of a timestamp when a scrape is initiated, long before samples from that scrape hit the storage layer.

In any case, the idea doesn't seem to be completely insane, so I'll file an issue in prometheus/prometheus about it. We can have further discussions there. Here it's pretty much off-topic, and the discussion of an interface change as a possible result of a possible solution is yet another level away, or two…

ncabatoffims · 2016-08-15T14:04:44Z

The same can happen if a recording rule is running during a scrape, which is a more supported use case.

Hang on, weren't upstream recording rules the proposed workaround to avoid the race in federation vs sample ingestion? But you seem to be saying that they have the exact same problem.

brian-brazil · 2016-08-15T15:56:17Z

I said it was less likely to be a problem, not that it'd solve it.

juliusv added the additional documentation label Aug 11, 2016

juliusv mentioned this issue Aug 11, 2016

Histogram incompletely pulled by federation request prometheus/prometheus#1887

Closed

beorn7 mentioned this issue Aug 15, 2016

Make rule evaluation and federation consistent in time prometheus/prometheus#1893

Closed

aylei pushed a commit to aylei/docs that referenced this issue Oct 28, 2019

op-guide: update deployment links and fix typo (prometheus#514)

839d4fd

dashpole mentioned this issue Apr 20, 2022

[receiver/prometheus] Validate that combined metrics all have the same timestamp open-telemetry/opentelemetry-collector-contrib#9385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document possible race conditions during federation #514

Document possible race conditions during federation #514

juliusv commented Aug 11, 2016

brian-brazil commented Aug 11, 2016

juliusv commented Aug 11, 2016

brian-brazil commented Aug 11, 2016 •

edited

Loading

juliusv commented Aug 11, 2016

brian-brazil commented Aug 11, 2016

juliusv commented Aug 11, 2016

brian-brazil commented Aug 12, 2016

beorn7 commented Aug 15, 2016

fabxc commented Aug 15, 2016

beorn7 commented Aug 15, 2016

fabxc commented Aug 15, 2016

brian-brazil commented Aug 15, 2016

beorn7 commented Aug 15, 2016

ncabatoffims commented Aug 15, 2016

brian-brazil commented Aug 15, 2016

Document possible race conditions during federation #514

Document possible race conditions during federation #514

Comments

juliusv commented Aug 11, 2016

brian-brazil commented Aug 11, 2016

juliusv commented Aug 11, 2016

brian-brazil commented Aug 11, 2016 • edited Loading

juliusv commented Aug 11, 2016

brian-brazil commented Aug 11, 2016

juliusv commented Aug 11, 2016

brian-brazil commented Aug 12, 2016

beorn7 commented Aug 15, 2016

fabxc commented Aug 15, 2016

beorn7 commented Aug 15, 2016

fabxc commented Aug 15, 2016

brian-brazil commented Aug 15, 2016

beorn7 commented Aug 15, 2016

ncabatoffims commented Aug 15, 2016

brian-brazil commented Aug 15, 2016

brian-brazil commented Aug 11, 2016 •

edited

Loading