-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diff: split bucket by random if the bucket's count is twice more than chunk-size #256
diff: split bucket by random if the bucket's count is twice more than chunk-size #256
Conversation
+------+-------+ | ||
|
||
FIXME: TiDB now don't return rand value when use `ORDER BY RAND()` | ||
mysql> SELECT `id` FROM (SELECT `id`, rand() rand_value FROM `test`.`test` WHERE `id` COLLATE "latin1_bin" > 0 AND `id` COLLATE "latin1_bin" < 100 ORDER BY rand_value LIMIT 5) rand_tmp ORDER BY `id` COLLATE "latin1_bin"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old sql will not select random value in tidb, update it to suggest sql, pingcap/tidb#9033
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any test for it? this kind of mistake is too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you means add test in tidb or in diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make diff
tools low availability, so how can we quickly find this problem next time? I understand that there is a lack of performance assessment.
pkg/diff/chunk.go
Outdated
/* for example: | ||
there is a bucket in TiDB, and the lowerbound and upperbound are (v1, v3), (v2, v4), and the columns are `a` and `b`, | ||
this bucket's data range is (a > v1 or (a == v1 and b >= v2)) and (a < v3 or (a == v3 and a <= v4)), | ||
not (a >= v1 and a <= v3 and b >= v2 and b <= v4) | ||
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old comment is wrong, v1 and v2 is column a's range, v3 and v4 is column b's range. And I change the chunk's range's expression, so need change this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the old implementation wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, just comment is wrong.
pkg/diff/chunk.go
Outdated
/* for example: | ||
there is a bucket in TiDB, and the lowerbound and upperbound are (v1, v3), (v2, v4), and the columns are `a` and `b`, | ||
this bucket's data range is (a > v1 or (a == v1 and b >= v2)) and (a < v3 or (a == v3 and a <= v4)), | ||
not (a >= v1 and a <= v3 and b >= v2 and b <= v4) | ||
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)), | |
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and b <= v4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
var chunks []*ChunkRange | ||
|
||
// splitRangeByRandom splits a chunk to multiple chunks by random | ||
func splitRangeByRandom(db *sql.DB, chunk *ChunkRange, count int, schema string, table string, columns []*model.ColumnInfo, limits, collation string) (chunks []*ChunkRange, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you can test the efficiency of splitting under various cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because I don't know whether it's better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will do a performance test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do a test, table with one hundred million rows, had 204 buckets in tidb's statistical information.
set the chunk-size to 1000, diff get random value 209 times, and total cost 1m50s to split chunk.
71837cb
to
7ce59e5
Compare
/run-all-tests |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What problem does this PR solve?
fix issue: https://internal.pingcap.net/jira/browse/TOOL-1372
tidb has a max value for bucket's num, so if one table have too many rows, one bucket will contains many rows, and may cause oom when select data.
What is changed and how it works?
Check List
Tests