receive: Replication #1270

squat · 2019-06-21T14:28:55Z

This commit adds a new replication feature to the Thanos receiver.
By default, replication is turned off, however, when replication is
enabled (replication factor >=2), the target node for a time series will
replicate the time series to the other nodes concurrently and
synchronously. If the replication to >= (rf+1)/2 nodes fails, the
original write request is failed.

Changes

add optional replication to the Thanos receive component
add e2e tests to exercise replication

Verification

e2e tests
manual testing to ensure replicated metrics appear in querier
checking new metrics to ensure replication errors don't increase in regular operation

cc @brancz @bwplotka @metalmatze

brancz · 2019-06-24T08:38:39Z

As replication wasn't specified in the original design, could you move the paragraph from open questions to the "Rollout/scaling/failure of receiver nodes" section and add appropriate detail?

brancz

The general strategy looks sound. Just a few nits to keep this clean.

brancz · 2019-06-24T08:45:08Z

pkg/receive/handler.go

+				errs = err
+				continue
+			}
+			errs = errors.Wrap(errs, err.Error())


MultiErrror makes more sense here no? In fact .parallelizeRequsets could immediately return a MultiError

Sounds good! I didn’t realize we had a util for that :)

brancz · 2019-06-24T08:47:32Z

pkg/receive/handler.go

@@ -49,16 +50,42 @@ var (
 		},
 		[]string{"handler"},
 	)
+	replicationRequestsTotal = prometheus.NewCounter(


let's move not only the registration, but also the initialization of these to the NewHandler function

Then we should also pass the prometheus.Registry into NewHandler?!

The registry is already passed into NewHandler via the options struct 👍

metalmatze · 2019-06-24T09:37:53Z

pkg/receive/handler.go

+			// Increment the counters as necessary now that
+			// the requests will go out.
+			defer func() {
+				requestCounter.Inc()


This will also be increased, even when there is an error. Is that intended?

yes, this is tracking the total, number of requests. failed or successful. Using the error counter we can know the percentage of errors.

it's probably worth unifying in one metric with different labels for success/failure

Yep, I'd prefer having a unified counter. 👍

so don’t track total; instead use one counter vec with a result label for success or failure. When we want to know the percentage of failed request we do errors/(errors+successes)?

This commit adds a new replication feature to the Thanos receiver. By default, replication is turned off, however, when replication is enabled (replication factor >=2), the target node for a time series will replicate the time series to the other nodes concurrently and synchronously. If the replication to >= (rf+1)/2 nodes fails, the original write request is failed.

brancz

Very slick 👍 And the tests are so simple, really love it!

bwplotka · 2019-06-24T16:02:34Z

docs/proposals/approved/201812_thanos-remote-receive.md

+
+### Replication
+
+The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:


Suggested change

The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:

The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FACTOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:

This commit makes a small fix to the spelling of `REPLICATION_FACTOR` as suggested by bwplotka in thanos-io#1270 (comment).

This commit makes a small fix to the spelling of `REPLICATION_FACTOR` as suggested by bwplotka in #1270 (comment).

squat force-pushed the replicate branch from f46e10b to f58a54c Compare June 21, 2019 14:33

brancz changed the title ~~Replicate~~ receive: Replication Jun 21, 2019

brancz reviewed Jun 24, 2019

View reviewed changes

metalmatze reviewed Jun 24, 2019

View reviewed changes

squat force-pushed the replicate branch from f58a54c to 818632d Compare June 24, 2019 12:34

squat added 2 commits June 24, 2019 14:53

test/e2e: add e2e tests for replication

8996eb8

squat force-pushed the replicate branch from 818632d to 8996eb8 Compare June 24, 2019 12:53

brancz approved these changes Jun 24, 2019

View reviewed changes

brancz merged commit 70ba420 into thanos-io:master Jun 24, 2019

squat mentioned this pull request Jun 24, 2019

Ability to use multiple dedup labels #1174

Closed

bwplotka reviewed Jun 24, 2019

View reviewed changes

squat added a commit to squat/thanos that referenced this pull request Jun 24, 2019

docs/proposals: fix replication factor spelling

7136c0f

This commit makes a small fix to the spelling of `REPLICATION_FACTOR` as suggested by bwplotka in thanos-io#1270 (comment).

squat mentioned this pull request Jun 24, 2019

docs/proposals: fix replication factor spelling #1273

Merged

brancz pushed a commit that referenced this pull request Jun 24, 2019

docs/proposals: fix replication factor spelling (#1273)

cbd6b82

This commit makes a small fix to the spelling of `REPLICATION_FACTOR` as suggested by bwplotka in #1270 (comment).

brancz mentioned this pull request Jun 26, 2019

Thanos receive component umbrella issue #1093

Closed

9 tasks

squat deleted the replicate branch August 31, 2019 01:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: Replication #1270

receive: Replication #1270

squat commented Jun 21, 2019

brancz commented Jun 24, 2019

brancz left a comment

brancz Jun 24, 2019

squat Jun 24, 2019

brancz Jun 24, 2019

metalmatze Jun 24, 2019 •

edited

Loading

brancz Jun 24, 2019

squat Jun 24, 2019

metalmatze Jun 24, 2019

squat Jun 24, 2019

brancz Jun 24, 2019

metalmatze Jun 24, 2019

squat Jun 24, 2019

brancz Jun 24, 2019

brancz left a comment

bwplotka Jun 24, 2019


		### Replication

		The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:

receive: Replication #1270

receive: Replication #1270

Conversation

squat commented Jun 21, 2019

Changes

Verification

brancz commented Jun 24, 2019

brancz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metalmatze Jun 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metalmatze Jun 24, 2019 •

edited

Loading