-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: Replication #1270
receive: Replication #1270
Conversation
As replication wasn't specified in the original design, could you move the paragraph from open questions to the "Rollout/scaling/failure of receiver nodes" section and add appropriate detail? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general strategy looks sound. Just a few nits to keep this clean.
pkg/receive/handler.go
Outdated
errs = err | ||
continue | ||
} | ||
errs = errors.Wrap(errs, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiErrror makes more sense here no? In fact .parallelizeRequsets
could immediately return a MultiError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I didn’t realize we had a util for that :)
pkg/receive/handler.go
Outdated
@@ -49,16 +50,42 @@ var ( | |||
}, | |||
[]string{"handler"}, | |||
) | |||
replicationRequestsTotal = prometheus.NewCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move not only the registration, but also the initialization of these to the NewHandler
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should also pass the prometheus.Registry
into NewHandler
?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registry is already passed into NewHandler via the options struct 👍
pkg/receive/handler.go
Outdated
// Increment the counters as necessary now that | ||
// the requests will go out. | ||
defer func() { | ||
requestCounter.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also be increased, even when there is an error. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is tracking the total, number of requests. failed or successful. Using the error counter we can know the percentage of errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably worth unifying in one metric with different labels for success/failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'd prefer having a unified counter. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so don’t track total; instead use one counter vec with a result label for success or failure. When we want to know the percentage of failed request we do errors/(errors+successes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
This commit adds a new replication feature to the Thanos receiver. By default, replication is turned off, however, when replication is enabled (replication factor >=2), the target node for a time series will replicate the time series to the other nodes concurrently and synchronously. If the replication to >= (rf+1)/2 nodes fails, the original write request is failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very slick 👍 And the tests are so simple, really love it!
|
||
### Replication | ||
|
||
The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag: | |
The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FACTOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag: |
This commit makes a small fix to the spelling of `REPLICATION_FACTOR` as suggested by bwplotka in thanos-io#1270 (comment).
This commit makes a small fix to the spelling of `REPLICATION_FACTOR` as suggested by bwplotka in #1270 (comment).
This commit adds a new replication feature to the Thanos receiver.
By default, replication is turned off, however, when replication is
enabled (replication factor >=2), the target node for a time series will
replicate the time series to the other nodes concurrently and
synchronously. If the replication to >= (rf+1)/2 nodes fails, the
original write request is failed.
Changes
Verification
cc @brancz @bwplotka @metalmatze