Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: split bucket by random if the bucket's count is twice more than chunk-size #256

Merged
merged 27 commits into from
Oct 10, 2019

Conversation

WangXiangUSTC
Copy link
Contributor

What problem does this PR solve?

fix issue: https://internal.pingcap.net/jira/browse/TOOL-1372
tidb has a max value for bucket's num, so if one table have too many rows, one bucket will contains many rows, and may cause oom when select data.

What is changed and how it works?

  1. refine the random spliter, use same range expression as bucket spliter
  2. when a bucket's count is twice bigger than chunk-size, use random split function to split it

Check List

Tests

  • Unit test

+------+-------+

FIXME: TiDB now don't return rand value when use `ORDER BY RAND()`
mysql> SELECT `id` FROM (SELECT `id`, rand() rand_value FROM `test`.`test` WHERE `id` COLLATE "latin1_bin" > 0 AND `id` COLLATE "latin1_bin" < 100 ORDER BY rand_value LIMIT 5) rand_tmp ORDER BY `id` COLLATE "latin1_bin";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old sql will not select random value in tidb, update it to suggest sql, pingcap/tidb#9033

Copy link
Collaborator

@IANTHEREAL IANTHEREAL Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any test for it? this kind of mistake is too bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you means add test in tidb or in diff?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make diff tools low availability, so how can we quickly find this problem next time? I understand that there is a lack of performance assessment.

/* for example:
there is a bucket in TiDB, and the lowerbound and upperbound are (v1, v3), (v2, v4), and the columns are `a` and `b`,
this bucket's data range is (a > v1 or (a == v1 and b >= v2)) and (a < v3 or (a == v3 and a <= v4)),
not (a >= v1 and a <= v3 and b >= v2 and b <= v4)
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old comment is wrong, v1 and v2 is column a's range, v3 and v4 is column b's range. And I change the chunk's range's expression, so need change this example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the old implementation wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, just comment is wrong.

/* for example:
there is a bucket in TiDB, and the lowerbound and upperbound are (v1, v3), (v2, v4), and the columns are `a` and `b`,
this bucket's data range is (a > v1 or (a == v1 and b >= v2)) and (a < v3 or (a == v3 and a <= v4)),
not (a >= v1 and a <= v3 and b >= v2 and b <= v4)
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),
this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and b <= v4)),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

var chunks []*ChunkRange

// splitRangeByRandom splits a chunk to multiple chunks by random
func splitRangeByRandom(db *sql.DB, chunk *ChunkRange, count int, schema string, table string, columns []*model.ColumnInfo, limits, collation string) (chunks []*ChunkRange, err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you can test the efficiency of splitting under various cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I don't know whether it's better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will do a performance test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do a test, table with one hundred million rows, had 204 buckets in tidb's statistical information.
set the chunk-size to 1000, diff get random value 209 times, and total cost 1m50s to split chunk.

pkg/check/table_structure.go Show resolved Hide resolved
pkg/diff/chunk.go Outdated Show resolved Hide resolved
pkg/diff/chunk.go Outdated Show resolved Hide resolved
pkg/diff/chunk.go Outdated Show resolved Hide resolved
pkg/diff/chunk.go Show resolved Hide resolved
pkg/diff/chunk.go Show resolved Hide resolved
@WangXiangUSTC
Copy link
Contributor Author

/run-all-tests

@IANTHEREAL
Copy link
Collaborator

LGTM

Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WangXiangUSTC WangXiangUSTC merged commit 3b04f08 into pingcap:master Oct 10, 2019
@WangXiangUSTC WangXiangUSTC deleted the xiang/split_buckey branch October 10, 2019 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants