Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: normalizer: smarter dedup #3262

Merged
merged 2 commits into from
Sep 30, 2017
Merged

v3: normalizer: smarter dedup #3262

merged 2 commits into from
Sep 30, 2017

Conversation

sougou
Copy link
Contributor

@sougou sougou commented Sep 30, 2017

Based on real-life workloads, we found that it may not be a good
idea to dedup all values. Specifically, values used in DMLs like
INSERT etc. should not be deduped. They actually end up polluting
the plan cache with all kinds of combinations.

With the change, values are deduped only if they are within selects.
This deduping happens even if there are subqueries withing DMLS,
while the DML parts are still not deduped.

Additionally, it doesn't make sense to take the effort to dedup
values that are too long. So, I've added a check where if a value
is longer than 256 bytes, we blindly create a new bind var.

@sougou sougou requested a review from demmer September 30, 2017 05:16
Copy link
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good -- I have a few comments on style and one general point about propagating errors out of the normalizer that applies both before and after this change.

// Modify RHS to be a list bindvar.
node.Right = ListArg(append([]byte("::"), bvname...))
nz := newNormalizer(stmt, bindVars, prefix)
_ = Walk(nz.WalkStatement, stmt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this particular change, but it seems like we should be propagating errors in the Normalize traversal out instead of silently swallowing them?

For PII reasons we actually do depend on normalize working properly, so it would be better for us to fail the query with an error than let it go through unnormalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the error conditions. They are all for invalid inputs, which will be caught downstream. So, it's better that normalizer ignores them and completes everything else. That way, any other PII would have been converted to bind vars.

}

func sqlToBindvar(node SQLNode) *querypb.BindVariable {
func (nz *normalizer) sqlToBindvar(node SQLNode) *querypb.BindVariable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the above comment re error propagation, this should also return an error if the value should have been converted to a bind var but failed for some reason.

}

func (nz *normalizer) convertSQLValDedup(node *SQLVal) {
// If value is too long, don't dedup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for this because the comparison is likely to be expensive CPU wise? If so we should add an additional comment indicating why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

// and iterate on converting each individual value into separate
// bind vars.
func (nz *normalizer) convertComparison(node *ComparisonExpr) {
switch node.Operator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the standard go style is, but I would find this particular bit easier to read as:

if node.Operator != InStr && node.Operator != NoInStr {
     return
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. It might have been my bias against double negations.

default:
return
}
// It's either IN or NOT IN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make the above change then this comment would be less necessary

Based on real-life workloads, we found that it may not be a good
idea to dedup all values. Specifically, values used in DMLs like
INSERT etc. should not be deduped. They actually end up polluting
the plan cache with all kinds of combinations.

With the change, values are deduped only if they are within selects.
This deduping happens even if there are subqueries withing DMLS,
while the DML parts are still not deduped.

Additionally, it doesn't make sense to take the effort to dedup
values that are too long. So, I've added a check where if a value
is longer than 256 bytes, we blindly create a new bind var.
@sougou sougou merged commit 326d4bc into vitessio:master Sep 30, 2017
@sougou sougou deleted the normalizer branch October 13, 2017 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants