diff: split bucket by random if the bucket's count is twice more than chunk-size #256

WangXiangUSTC · 2019-07-16T06:17:33Z

What problem does this PR solve?

fix issue: https://internal.pingcap.net/jira/browse/TOOL-1372
tidb has a max value for bucket's num, so if one table have too many rows, one bucket will contains many rows, and may cause oom when select data.

What is changed and how it works?

refine the random spliter, use same range expression as bucket spliter
when a bucket's count is twice bigger than chunk-size, use random split function to split it

Check List

Tests

Unit test

WangXiangUSTC · 2019-07-16T06:22:27Z

pkg/dbutil/common.go

-		+------+-------+
-
-		FIXME: TiDB now don't return rand value when use `ORDER BY RAND()`
+		mysql> SELECT `id` FROM (SELECT `id`, rand() rand_value FROM `test`.`test`  WHERE `id` COLLATE "latin1_bin" > 0 AND `id` COLLATE "latin1_bin" < 100 ORDER BY rand_value LIMIT 5) rand_tmp ORDER BY `id` COLLATE "latin1_bin";


old sql will not select random value in tidb, update it to suggest sql, pingcap/tidb#9033

any test for it? this kind of mistake is too bad.

you means add test in tidb or in diff?

It would make diff tools low availability, so how can we quickly find this problem next time? I understand that there is a lack of performance assessment.

IANTHEREAL · 2019-08-15T04:50:50Z

pkg/diff/chunk.go

 	/* for example:
 	there is a bucket in TiDB, and the lowerbound and upperbound are (v1, v3), (v2, v4), and the columns are `a` and `b`,
-	this bucket's data range is (a > v1 or (a == v1 and b >= v2)) and (a < v3 or (a == v3 and a <= v4)),
-	not (a >= v1 and a <= v3 and b >= v2 and b <= v4)
+	this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),


why change it?

the old comment is wrong, v1 and v2 is column a's range, v3 and v4 is column b's range. And I change the chunk's range's expression, so need change this example.

is the old implementation wrong

no, just comment is wrong.

…/tidb-tools into xiang/split_buckey

IANTHEREAL · 2019-08-15T06:04:21Z

pkg/diff/chunk.go

 	/* for example:
 	there is a bucket in TiDB, and the lowerbound and upperbound are (v1, v3), (v2, v4), and the columns are `a` and `b`,
-	this bucket's data range is (a > v1 or (a == v1 and b >= v2)) and (a < v3 or (a == v3 and a <= v4)),
-	not (a >= v1 and a <= v3 and b >= v2 and b <= v4)
+	this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),


Suggested change

this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),

this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and b <= v4)),

IANTHEREAL · 2019-08-15T06:26:19Z

pkg/diff/chunk.go

-	var chunks []*ChunkRange
-
+// splitRangeByRandom splits a chunk to multiple chunks by random
+func splitRangeByRandom(db *sql.DB, chunk *ChunkRange, count int, schema string, table string, columns []*model.ColumnInfo, limits, collation string) (chunks []*ChunkRange, err error) {


I hope you can test the efficiency of splitting under various cases.

because I don't know whether it's better

ok, I will do a performance test

I do a test, table with one hundred million rows, had 204 buckets in tidb's statistical information.
set the chunk-size to 1000, diff get random value 209 times, and total cost 1m50s to split chunk.

pkg/check/table_structure.go

pkg/diff/chunk.go

pkg/diff/util_test.go

… too large` error in TiDB (pingcap#258)

WangXiangUSTC · 2019-08-30T02:35:19Z

/run-all-tests

IANTHEREAL · 2019-10-09T10:46:34Z

LGTM

csuzhangxc

LGTM

WangXiangUSTC added 9 commits July 1, 2019 16:54

update get random value function

52d1c96

add random split when chunk is big

a4abcfd

minor fix

b094cdd

refine split by random

06f2ab5

refine random split code

84a7a80

update unit test

7b82abe

clean code

055dd17

remove useless code

1c91b7b

minor update

cb4a7a0

WangXiangUSTC added type/enhancement priority/important status/PTAL labels Jul 16, 2019

WangXiangUSTC commented Jul 16, 2019

View reviewed changes

IANTHEREAL added 2 commits July 29, 2019 15:24

Merge branch 'master' into xiang/split_buckey

49e7a80

Merge branch 'master' into xiang/split_buckey

cd0813d

IANTHEREAL reviewed Aug 15, 2019

View reviewed changes

WangXiangUSTC added 2 commits August 15, 2019 14:02

update comment

adef12a

Merge branch 'xiang/split_buckey' of https://github.com/WangXiangUSTC…

d9d4d33

…/tidb-tools into xiang/split_buckey

IANTHEREAL reviewed Aug 15, 2019

View reviewed changes

fix comment

556e2b2

csuzhangxc reviewed Aug 20, 2019

View reviewed changes

WangXiangUSTC added 2 commits August 21, 2019 13:17

address comment

6a4717d

add function minLenInSlices

5bcf5bc

csuzhangxc reviewed Aug 21, 2019

View reviewed changes

pkg/diff/util_test.go Show resolved Hide resolved

WangXiangUSTC and others added 5 commits August 21, 2019 14:13

minor fix

5389e91

add columnOffset in ChunkRange

0ad3dcf

minor fix

319aa0a

diff: update ignore-column config && add integration test (pingcap#248)

21edb7a

diff: clean checkpoint info in several times to avoid `transaction is…

46ec25a

… too large` error in TiDB (pingcap#258)

minor fix

7ce59e5

WangXiangUSTC force-pushed the xiang/split_buckey branch from 71837cb to 7ce59e5 Compare August 21, 2019 08:32

WangXiangUSTC and others added 3 commits August 21, 2019 16:39

fix test

4a4d7fa

add haslower and hasupper in bound

040a4a2

Merge branch 'master' into xiang/split_buckey

ba6b516

WangXiangUSTC and others added 2 commits September 5, 2019 11:41

Merge branch 'master' into xiang/split_buckey

9897a07

Merge branch 'master' into xiang/split_buckey

05505bb

IANTHEREAL added status/LGT1 and removed status/PTAL labels Oct 9, 2019

csuzhangxc approved these changes Oct 9, 2019

View reviewed changes

csuzhangxc added status/LGT2 and removed status/LGT1 labels Oct 9, 2019

WangXiangUSTC merged commit 3b04f08 into pingcap:master Oct 10, 2019

WangXiangUSTC deleted the xiang/split_buckey branch October 10, 2019 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diff: split bucket by random if the bucket's count is twice more than chunk-size #256

diff: split bucket by random if the bucket's count is twice more than chunk-size #256

WangXiangUSTC commented Jul 16, 2019

WangXiangUSTC Jul 16, 2019

IANTHEREAL Aug 2, 2019 •

edited

Loading

WangXiangUSTC Aug 15, 2019

IANTHEREAL Aug 15, 2019

IANTHEREAL Aug 15, 2019

WangXiangUSTC Aug 15, 2019

IANTHEREAL Aug 15, 2019

WangXiangUSTC Aug 19, 2019

IANTHEREAL Aug 15, 2019

WangXiangUSTC Aug 19, 2019

IANTHEREAL Aug 15, 2019

IANTHEREAL Aug 15, 2019

WangXiangUSTC Aug 16, 2019

WangXiangUSTC Oct 9, 2019

WangXiangUSTC commented Aug 30, 2019

IANTHEREAL commented Oct 9, 2019

csuzhangxc left a comment

	this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and a <= v4)),
	this bucket's data range is (a > v1 or (a == v1 and b > v3)) and (a < v2 or (a == v2 and b <= v4)),

diff: split bucket by random if the bucket's count is twice more than chunk-size #256

diff: split bucket by random if the bucket's count is twice more than chunk-size #256

Conversation

WangXiangUSTC commented Jul 16, 2019

What problem does this PR solve?

What is changed and how it works?

Check List

Choose a reason for hiding this comment

IANTHEREAL Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WangXiangUSTC commented Aug 30, 2019

IANTHEREAL commented Oct 9, 2019

csuzhangxc left a comment

Choose a reason for hiding this comment

IANTHEREAL Aug 2, 2019 •

edited

Loading