Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More bucketing of vterrors #3134

Merged
merged 7 commits into from
Sep 7, 2017
Merged

Conversation

bbeaudreault
Copy link
Contributor

@bbeaudreault bbeaudreault commented Sep 5, 2017

@sougou I went on another spree. This time I think I got almost all likely or close-to-likely mysql error codes. I additionally went through query_engine code paths and tried to change everything I could find to an appropriate vterror code. Most ended up being INVALID_ARGUMENT

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Also pinging @alainjobart as second pair of eyes.

@@ -338,7 +337,7 @@ func analyzeInsertNoType(ins *sqlparser.Insert, plan *Plan, table *schema.Table)
for _, col := range ins.Columns {
colIndex := table.FindColumn(col)
if colIndex == -1 {
return nil, fmt.Errorf("column %v not found in table %s", col, table.Name)
return nil, vterrors.Errorf(vtrpcpb.Code_NOT_FOUND, "column %v not found in table %s", col, table.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still be invalid argument (same comment below). Not Found is more for missing data. I see that it's also misused in other places, which we should eventually fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I had that thought as well when I've encountered elsewhere. I'll do a quick scan for NOT_FOUNDs and fix

@bbeaudreault
Copy link
Contributor Author

bbeaudreault commented Sep 5, 2017

Up to you guys how closely you look at the bucketing I did of the MySQL errno's. It was rather painstaking and I made a judgement call on each, and I think anything is better than UNKNOWN. We can re-bucket if anything becomes controversial.

@alainjobart
Copy link
Contributor

alainjobart commented Sep 5, 2017

LGTM
This is really cool, thank you for taking the time to do this! I agree we can move errors around if for some reason some of them are not right.

Approved with PullApprove

case mysql.EROptionPreventsStatement:
// Special-case this error code. It's probably because
// there was a failover and there are old clients still connected.
if strings.Contains(errstr, "read-only") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which error did we return in the implicit "else" branch before your change?

Do I see it correctly, that you're always returning FAILED_PRECONDITION now if the MySQL error code is EROptionPreventsStatement?

I think this may be a regression.

It looks like 1290 is another MySQL code which was repurposed by the internal Google MySQL patches? When I google for "mysql error 1290", results always talk about "secure-file-priv" and never about running in read-only mode. Pinging @pivanof as well here.

If that's the case, I think we need to keep this if and add an else here like you did for access denied below (another MySQL error repurposed by Google patches).

You may wonder why this is relevant: The vtgate buffer gets triggered based on the Vitess error code (FAILED_PRECONDITION) and the MySQL error number. See: https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L255 Since we're stripping the MySQL error message to protect PII, we can no longer check for a "read-only" string in the error message in vtgate. As a consequence, we may detect these "secure-file-priv" problems, the original purpose of the MySQL error, as failover and start buffering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you search for "mysql read_only error", the first link to StackOverflow quotes: "General error: 1290 The MySQL server is running with the --read-only option so it cannot execute this statement". We didn't repurpose 1290 for read_only -- it's the native error code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michael-berlin you can see the original code here: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/tabletserver/tabletserver.go#L1338

The default value for errCode is UNKNOWN. We returned FAILED_PRECONDITION only for the read-only case. My change makes us return FAILED_PRECONDITION in all cases. I think this makes sense.

ERSpecifiedAccessDenied (1227) was repurposed for google, and I kept that logic in tact.

Copy link
Contributor Author

@bbeaudreault bbeaudreault Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the diff lined things up weirdly, because I also reordered the blocks. See that EROptionPreventsStatement still returns FAILED_PRECONDITION by searching for EROptionPreventsStatement in the diff and see which switch case it matches against.

The diff makes it look like I changed that to RESOURCE_EXHAUSTED because of the line numbers. But it was just moved down a few lines and still returns FAILED_PRECONDITION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you linked to in buffer.go also expected FAILED_PRECONDITION

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't clear enough here.

Please readd the if here i.e. if read-only, then FAILED_PRECONDITION, otherwise any other error code (which one makes sense?).

If we don't do that, then non read-only errors can trigger the vtgate buffer. This is bad because the purpose of the vtgate buffer is to let MASTER requests survive master failovers by detecting them, stalling pending and future requests and retrying them when the failover is done. In case of a non read-only error, we will not observe the failover end and instead buffering will stop automatically after the timeout (20 seconds by default).

When do non read-only 1290 errors occur? Apparently when you send LOAD queries and try to access a file outside the path specified by MySQL's --secure_file_priv. That means the current version would allow people to provoke buffering and stalling requests although no failover is in progress. (Note that buffering is turned off by default and must be explicitly enabled.)

If you're saying, we should fix the buffering trigger instead, I agree with you. The problem is that the MySQL error message itself is stripped at that point i.e. vtgate cannot find out if it's a read-only error or not. Instead, we only see 1290 and FAILED_PRECONDITION. Therefore, we must not see FAILED_PRECONDITION if it wasn't a read-only error.

In general, we're confined by the Vitess error space here. We do this on purpose to be consistent with the Google canonical error codes, i.e. we do not want to add a new error code which says "this error should trigger buffering". As an alternative, I think we can add code in vttablet which would determine an error as a buffer trigger and add a special string in the error message as marker after the MySQL error message was possibly stripped. But that's a different pull request for another day ;-)

Note that even if we had not this weird dependency on the buffering trigger code, I think FAILED_PRECONDITION for all 1290 errors would not be correct. FAILED_PRECONDITION allows to restart things at a higher level e.g. a transaction from the beginning again. We mostly use it to signal vtgate that a REPLICA vttablet shut down and vtgate should try a different REPLICA instead. This is not applicable for the secure_file_priv error: The user tries to access data which it isn't allowed to and this shouldn't change when they use a different task.

Does all this make more sense now?

If so, can you recommend an error code for the non read-only errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I'll limit to just the read-only errors, and keep the rest as UNKNOWN for now

@bbeaudreault
Copy link
Contributor Author

I fixed the regression, let me know if there are any other comments.

Also shard 2 keeps failing after 30+ minutes and the errors appear to be mysql server related in vtgate. Any thoughts?

@bbeaudreault
Copy link
Contributor Author

My other unrelated PR is also having the same issue. I'm thinking it's an issue with travis? #3139

@michael-berlin
Copy link
Contributor

I ran mysql_server_test.py locally and it passes.

However, the unit test in go/vt/vtgate/planbuilder/ is failing. Can you please have a look?

$ go test
# Invalid value in IN clause
"select id from user where id in (18446744073709551616, 1)"
"could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

# Invalid value in IN clause from LHS of join
"select u1.id from user u1 join user u2 where u1.id = 18446744073709551616"
"could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

# Invalid value in IN clause from RHS of join
"select u1.id from user u1 join user u2 where u2.id = 18446744073709551616"
"could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

# insert unsharded, invalid value for auto-inc
"insert into unsharded_auto(id, val) values(18446744073709551616, 'aa')"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

# insert for non-vindex autoinc, invalid value
"insert into user_extra(nonid, extra_id) values (2, 18446744073709551616)"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

# insert invalid index value
"insert into music_extra(music_id, user_id) values(1, 18446744073709551616)"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

# replace unsharded, invalid value for auto-inc
"replace into unsharded_auto(id, val) values(18446744073709551616, 'aa')"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"

--- FAIL: TestPlan (0.03s)
	plan_test.go:166: File: wireup_cases.txt, Line:493
		could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		strconv.ParseUint: parsing "18446744073709551616": value out of range
	plan_test.go:166: File: wireup_cases.txt, Line:497
		could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		strconv.ParseUint: parsing "18446744073709551616": value out of range
	plan_test.go:166: File: wireup_cases.txt, Line:501
		could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		strconv.ParseUint: parsing "18446744073709551616": value out of range
	plan_test.go:166: File: dml_cases.txt, Line:332
		could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
	plan_test.go:166: File: dml_cases.txt, Line:757
		could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
	plan_test.go:166: File: dml_cases.txt, Line:761
		could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
	plan_test.go:166: File: dml_cases.txt, Line:842
		could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
		could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
FAIL
exit status 1
FAIL	github.com/youtube/vitess/go/vt/vtgate/planbuilder	0.082s

@bbeaudreault
Copy link
Contributor Author

Sorry I missed that in all the spam. It's fixed now. I can't run the full suite of tests locally so I generally rely on travis to run them. Are the constantly failing mysql tests causing other tests to not run? That might be concerning, maybe we should disable them if there's no lead as to the problem.

@michael-berlin
Copy link
Contributor

make unit_test is passing now, thanks. Let me try to import this change internally now. If that succeeds, I'll merge it.

In the Travis log you can see that "unit_test" is reported as timed out by our test.go test runner. After that, something seems to hang in the test runner (it doesn't print the summary) and therefore Travis kills the job eventually.

re: running tests. If you have Docker installed on your system, you can run any test locally via Docker. Example for "unit_test": $ ./test.go unit

re: mysql_server_test.py: If it's flaky, let's deflake it. It's probably just missing reasonable "wait 2 minutes for this binary to be available" checks.

@@ -282,13 +284,15 @@ func newIntegralNumeric(v Value) (result numeric, err error) {
case v.IsSigned():
result.ival, err = strconv.ParseInt(str, 10, 64)
if err != nil {
err = vterrors.Errorf(vtrpcpb.Code_INVALID_ARGUMENT, "%v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal tools show that this line and line 295 is currently not covered by tests.

Can you please add cases in "TestNewIntegralNumeric" in arithmetic_test.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about that? I ran coverage against this branch and it shows it is covered. Maybe you ran prior to one of my recent commits, which changed the wording of the error? These two lines are covered by the two tests here https://github.com/youtube/vitess/blob/master/go/sqltypes/arithmetic_test.go#L546

The test does not validate the vtrpcpb code, just the error string. If you want me to validate the code I can refactor the test I guess

@michael-berlin
Copy link
Contributor

Merging now because it's already submitted internally.

@bbeaudreault Did you see my comment re: the missing tests? Can you please add them in a separate PR? :)

@michael-berlin michael-berlin merged commit d2f34fc into vitessio:master Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants