-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More bucketing of vterrors #3134
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Also pinging @alainjobart as second pair of eyes.
@@ -338,7 +337,7 @@ func analyzeInsertNoType(ins *sqlparser.Insert, plan *Plan, table *schema.Table) | |||
for _, col := range ins.Columns { | |||
colIndex := table.FindColumn(col) | |||
if colIndex == -1 { | |||
return nil, fmt.Errorf("column %v not found in table %s", col, table.Name) | |||
return nil, vterrors.Errorf(vtrpcpb.Code_NOT_FOUND, "column %v not found in table %s", col, table.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should still be invalid argument (same comment below). Not Found is more for missing data. I see that it's also misused in other places, which we should eventually fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I had that thought as well when I've encountered elsewhere. I'll do a quick scan for NOT_FOUNDs and fix
1430e43
to
c6483ef
Compare
Up to you guys how closely you look at the bucketing I did of the MySQL errno's. It was rather painstaking and I made a judgement call on each, and I think anything is better than |
case mysql.EROptionPreventsStatement: | ||
// Special-case this error code. It's probably because | ||
// there was a failover and there are old clients still connected. | ||
if strings.Contains(errstr, "read-only") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which error did we return in the implicit "else" branch before your change?
Do I see it correctly, that you're always returning FAILED_PRECONDITION now if the MySQL error code is EROptionPreventsStatement?
I think this may be a regression.
It looks like 1290 is another MySQL code which was repurposed by the internal Google MySQL patches? When I google for "mysql error 1290", results always talk about "secure-file-priv" and never about running in read-only mode. Pinging @pivanof as well here.
If that's the case, I think we need to keep this if and add an else here like you did for access denied below (another MySQL error repurposed by Google patches).
You may wonder why this is relevant: The vtgate buffer gets triggered based on the Vitess error code (FAILED_PRECONDITION) and the MySQL error number. See: https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L255 Since we're stripping the MySQL error message to protect PII, we can no longer check for a "read-only" string in the error message in vtgate. As a consequence, we may detect these "secure-file-priv" problems, the original purpose of the MySQL error, as failover and start buffering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you search for "mysql read_only error", the first link to StackOverflow quotes: "General error: 1290 The MySQL server is running with the --read-only option so it cannot execute this statement". We didn't repurpose 1290 for read_only -- it's the native error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michael-berlin you can see the original code here: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/tabletserver/tabletserver.go#L1338
The default value for errCode is UNKNOWN
. We returned FAILED_PRECONDITION
only for the read-only case. My change makes us return FAILED_PRECONDITION
in all cases. I think this makes sense.
ERSpecifiedAccessDenied
(1227) was repurposed for google, and I kept that logic in tact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the diff lined things up weirdly, because I also reordered the blocks. See that EROptionPreventsStatement
still returns FAILED_PRECONDITION
by searching for EROptionPreventsStatement
in the diff and see which switch case it matches against.
The diff makes it look like I changed that to RESOURCE_EXHAUSTED
because of the line numbers. But it was just moved down a few lines and still returns FAILED_PRECONDITION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you linked to in buffer.go also expected FAILED_PRECONDITION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear enough here.
Please readd the if
here i.e. if read-only, then FAILED_PRECONDITION
, otherwise any other error code (which one makes sense?).
If we don't do that, then non read-only errors can trigger the vtgate buffer. This is bad because the purpose of the vtgate buffer is to let MASTER requests survive master failovers by detecting them, stalling pending and future requests and retrying them when the failover is done. In case of a non read-only error, we will not observe the failover end and instead buffering will stop automatically after the timeout (20 seconds by default).
When do non read-only 1290 errors occur? Apparently when you send LOAD queries and try to access a file outside the path specified by MySQL's --secure_file_priv
. That means the current version would allow people to provoke buffering and stalling requests although no failover is in progress. (Note that buffering is turned off by default and must be explicitly enabled.)
If you're saying, we should fix the buffering trigger instead, I agree with you. The problem is that the MySQL error message itself is stripped at that point i.e. vtgate cannot find out if it's a read-only error or not. Instead, we only see 1290 and FAILED_PRECONDITION
. Therefore, we must not see FAILED_PRECONDITION
if it wasn't a read-only error.
In general, we're confined by the Vitess error space here. We do this on purpose to be consistent with the Google canonical error codes, i.e. we do not want to add a new error code which says "this error should trigger buffering". As an alternative, I think we can add code in vttablet which would determine an error as a buffer trigger and add a special string in the error message as marker after the MySQL error message was possibly stripped. But that's a different pull request for another day ;-)
Note that even if we had not this weird dependency on the buffering trigger code, I think FAILED_PRECONDITION
for all 1290 errors would not be correct. FAILED_PRECONDITION
allows to restart things at a higher level e.g. a transaction from the beginning again. We mostly use it to signal vtgate that a REPLICA vttablet shut down and vtgate should try a different REPLICA instead. This is not applicable for the secure_file_priv
error: The user tries to access data which it isn't allowed to and this shouldn't change when they use a different task.
Does all this make more sense now?
If so, can you recommend an error code for the non read-only errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense. I'll limit to just the read-only errors, and keep the rest as UNKNOWN
for now
c6483ef
to
e2e17af
Compare
I fixed the regression, let me know if there are any other comments. Also shard 2 keeps failing after 30+ minutes and the errors appear to be mysql server related in vtgate. Any thoughts? |
My other unrelated PR is also having the same issue. I'm thinking it's an issue with travis? #3139 |
…nto appropriate vtrpc code
982c092
to
2223e6c
Compare
I ran However, the unit test in $ go test
# Invalid value in IN clause
"select id from user where id in (18446744073709551616, 1)"
"could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
# Invalid value in IN clause from LHS of join
"select u1.id from user u1 join user u2 where u1.id = 18446744073709551616"
"could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
# Invalid value in IN clause from RHS of join
"select u1.id from user u1 join user u2 where u2.id = 18446744073709551616"
"could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
# insert unsharded, invalid value for auto-inc
"insert into unsharded_auto(id, val) values(18446744073709551616, 'aa')"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
# insert for non-vindex autoinc, invalid value
"insert into user_extra(nonid, extra_id) values (2, 18446744073709551616)"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
# insert invalid index value
"insert into music_extra(music_id, user_id) values(1, 18446744073709551616)"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
# replace unsharded, invalid value for auto-inc
"replace into unsharded_auto(id, val) values(18446744073709551616, 'aa')"
"could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range"
--- FAIL: TestPlan (0.03s)
plan_test.go:166: File: wireup_cases.txt, Line:493
could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
strconv.ParseUint: parsing "18446744073709551616": value out of range
plan_test.go:166: File: wireup_cases.txt, Line:497
could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
strconv.ParseUint: parsing "18446744073709551616": value out of range
plan_test.go:166: File: wireup_cases.txt, Line:501
could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
strconv.ParseUint: parsing "18446744073709551616": value out of range
plan_test.go:166: File: dml_cases.txt, Line:332
could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
plan_test.go:166: File: dml_cases.txt, Line:757
could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
plan_test.go:166: File: dml_cases.txt, Line:761
could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
plan_test.go:166: File: dml_cases.txt, Line:842
could not compute value for vindex or auto-inc column: could not create integral from val 18446744073709551616: strconv.ParseUint: parsing "18446744073709551616": value out of range, want
could not compute value for vindex or auto-inc column: strconv.ParseUint: parsing "18446744073709551616": value out of range
FAIL
exit status 1
FAIL github.com/youtube/vitess/go/vt/vtgate/planbuilder 0.082s
|
Sorry I missed that in all the spam. It's fixed now. I can't run the full suite of tests locally so I generally rely on travis to run them. Are the constantly failing mysql tests causing other tests to not run? That might be concerning, maybe we should disable them if there's no lead as to the problem. |
eff3b45
to
f12fd15
Compare
In the Travis log you can see that "unit_test" is reported as timed out by our test.go test runner. After that, something seems to hang in the test runner (it doesn't print the summary) and therefore Travis kills the job eventually. re: running tests. If you have Docker installed on your system, you can run any test locally via Docker. Example for "unit_test": re: mysql_server_test.py: If it's flaky, let's deflake it. It's probably just missing reasonable "wait 2 minutes for this binary to be available" checks. |
@@ -282,13 +284,15 @@ func newIntegralNumeric(v Value) (result numeric, err error) { | |||
case v.IsSigned(): | |||
result.ival, err = strconv.ParseInt(str, 10, 64) | |||
if err != nil { | |||
err = vterrors.Errorf(vtrpcpb.Code_INVALID_ARGUMENT, "%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal tools show that this line and line 295 is currently not covered by tests.
Can you please add cases in "TestNewIntegralNumeric" in arithmetic_test.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about that? I ran coverage against this branch and it shows it is covered. Maybe you ran prior to one of my recent commits, which changed the wording of the error? These two lines are covered by the two tests here https://github.com/youtube/vitess/blob/master/go/sqltypes/arithmetic_test.go#L546
The test does not validate the vtrpcpb code, just the error string. If you want me to validate the code I can refactor the test I guess
Merging now because it's already submitted internally. @bbeaudreault Did you see my comment re: the missing tests? Can you please add them in a separate PR? :) |
@sougou I went on another spree. This time I think I got almost all likely or close-to-likely mysql error codes. I additionally went through query_engine code paths and tried to change everything I could find to an appropriate vterror code. Most ended up being
INVALID_ARGUMENT