[coordinator] Return HTTP 400 if all sent samples encounter too old/new or other bad request error #1692

robskillington · 2019-06-01T09:32:45Z

What this PR does / why we need it:

This should fix issue #1709.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

…ew or bad request

robskillington · 2019-06-01T09:34:07Z

Need to add tests still.

arnikola · 2019-06-03T17:11:29Z

src/query/api/v1/handler/prometheus/remote/write.go

+		for _, err := range errs {
+			switch {
+			case client.IsBadRequestError(err):
+				fallthrough


Might be worth having a different output error here if it's bad requests vs invalid params specifically?

We need both to be HTTP 400 (since local aggregation can also cause a "too old" error, which will be invalidparams not a client bad request error).

The error actually returned to the client I haven't changed, it will always be the "last error".

…e write on too far in past

prateek · 2019-06-06T16:12:47Z

src/cmd/services/m3coordinator/ingest/write.go


 	Storage() storage.Storage
 }

+// BatchError allows for access to individual errors.
+type BatchError interface {
+	error


Considering we’re only returning the last error in this path, can we add a new implementation of multierr which doesn’t accumulate?

So I need to actually look at each error and determine 100% of them are bad request like errors, otherwise if there was one real error, the request needs to be retried.

I use Errors() to first establish all real errors are bad request like errors, then I use LastError() just to grab the one for logging purposes. I could probably remote LastError() tbh from this interface.

prateek

LGTM w/nit

arnikola · 2019-06-06T18:16:55Z

scripts/development/m3_stack/start_m3.sh

@@ -183,7 +186,7 @@ echo "Validating topology"
 echo "Done validating topology"

 echo "Waiting until shards are marked as available"
-ATTEMPTS=10 TIMEOUT=2 retry_with_backoff  \
+ATTEMPTS=100 TIMEOUT=2 retry_with_backoff  \


Yes! Much nicer haha, can we do the same for integration tests?

True yeah, will update the common ones.

codecov · 2019-06-07T19:37:45Z

Codecov Report

Merging #1692 into master will decrease coverage by 14.7%.
The diff coverage is 66.6%.

@@            Coverage Diff            @@
##           master   #1692      +/-   ##
=========================================
- Coverage    71.9%   57.1%   -14.8%     
=========================================
  Files         976     968       -8     
  Lines       81530   81092     -438     
=========================================
- Hits        58693   46378   -12315     
- Misses      18996   31113   +12117     
+ Partials     3841    3601     -240

Flag	Coverage Δ
#aggregator	`70.6% <ø> (-11.9%)`	⬇️
#cluster	`58.5% <ø> (-27.3%)`	⬇️
#collector	`63.9% <ø> (ø)`	⬆️
#dbnode	`72.2% <100%> (-7.9%)`	⬇️
#m3em	`52.3% <ø> (-21%)`	⬇️
#m3ninx	`61% <ø> (-13.2%)`	⬇️
#m3nsch	`78% <ø> (+26.8%)`	⬆️
#metrics	`17.6% <ø> (ø)`	⬆️
#msg	`74.7% <ø> (-0.2%)`	⬇️
#query	`32.2% <65%> (-34.2%)`	⬇️
#x	`68.3% <71.4%> (-17.2%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 294ce46...2a7382b. Read the comment docs.

…m3 into r/return-400-on-bad-request-writes

robskillington added 2 commits June 1, 2019 11:31

[coordinator] Return 400 when errors are all due to samples too old/n…

e36a103

…ew or bad request

Reorder import

98c1909

Defer appender

cf94bf8

robskillington force-pushed the r/return-400-on-bad-request-writes branch from 32bf1e0 to cf94bf8 Compare June 1, 2019 15:41

arnikola reviewed Jun 3, 2019

View reviewed changes

robskillington added 2 commits June 6, 2019 01:41

Add integration test for checking for HTTP status 400 from prom remot…

abe6682

…e write on too far in past

Use LastError even for multi appender

f8f76c0

prateek reviewed Jun 6, 2019

View reviewed changes

prateek approved these changes Jun 6, 2019

View reviewed changes

Check that the status code is 400 more reliably

43c5a45

arnikola reviewed Jun 6, 2019

View reviewed changes

robskillington added 2 commits June 6, 2019 14:53

Use bridged network instead of network host

7d8927a

Merge branch 'master' into r/return-400-on-bad-request-writes

8b64f70

robskillington mentioned this pull request Jun 6, 2019

Non-retryable Prometheus remote write errors should return 400 #1709

Closed

robskillington added 8 commits June 6, 2019 15:21

Detect the docker network using network ls

6eb9e76

Fix unit tests

563e9c4

Merge branch 'master' into r/return-400-on-bad-request-writes

d835626

Fix bootstrapped timeout in docker integration tests

ee42b59

Fix remote write unit tests

df14299

Merge branch 'master' into r/return-400-on-bad-request-writes

7e8884c

Add verbose logging and batch errors breakdown

496a9b2

Ensure command success and does not fail test

4291c1e

robskillington changed the title ~~[coordinator] Return 400 if all sent samples too old/new or bad request in some way~~ [coordinator] Return HTTP 400 if all sent samples encounter too old/new or other bad request error Jun 7, 2019

Fix metrics for response status code

2a7382b

robskillington added 4 commits June 7, 2019 16:14

Fix test

a27d20a

Merge branch 'master' into r/return-400-on-bad-request-writes

e3fbb3d

Bump retries for carbon docker integration tests

4b61aac

Merge branch 'r/return-400-on-bad-request-writes' of github.com:m3db/…

4c0dd4a

…m3 into r/return-400-on-bad-request-writes

robskillington merged commit 78d138a into master Jun 7, 2019

robskillington deleted the r/return-400-on-bad-request-writes branch June 7, 2019 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coordinator] Return HTTP 400 if all sent samples encounter too old/new or other bad request error #1692

[coordinator] Return HTTP 400 if all sent samples encounter too old/new or other bad request error #1692

robskillington commented Jun 1, 2019 •

edited

Loading

robskillington commented Jun 1, 2019

arnikola Jun 3, 2019

robskillington Jun 4, 2019

prateek Jun 6, 2019

robskillington Jun 6, 2019

prateek left a comment

arnikola Jun 6, 2019

robskillington Jun 6, 2019

codecov bot commented Jun 7, 2019

[coordinator] Return HTTP 400 if all sent samples encounter too old/new or other bad request error #1692

[coordinator] Return HTTP 400 if all sent samples encounter too old/new or other bad request error #1692

Conversation

robskillington commented Jun 1, 2019 • edited Loading

robskillington commented Jun 1, 2019

arnikola Jun 3, 2019

Choose a reason for hiding this comment

robskillington Jun 4, 2019

Choose a reason for hiding this comment

prateek Jun 6, 2019

Choose a reason for hiding this comment

robskillington Jun 6, 2019

Choose a reason for hiding this comment

prateek left a comment

Choose a reason for hiding this comment

arnikola Jun 6, 2019

Choose a reason for hiding this comment

robskillington Jun 6, 2019

Choose a reason for hiding this comment

codecov bot commented Jun 7, 2019

Codecov Report

robskillington commented Jun 1, 2019 •

edited

Loading