feat(shell): count_data return estimate count by default #519

foreverneverer · 2020-04-15T11:48:06Z

What problem does this PR solve?

count_data need scan all online records to get precise result, which may affect
cluster availability, so here define precise = false defaultly and it will return estimate
count immediately.

What is changed and how it works?

Define precise = false defaultly, if want to get precise result, user must add [-c|--precise] explictly.

Check List

Tests

Manual test (add detailed scripts or steps below)

// start onebox
>>> count_data
[count_data]
pidx       estimate_count  
0          14.00           
1          341.00          
2          381.00          
3          361.00          
4          30.00           
5          203.00          
6          63.00           
7          52.00           
(total:8)  1445.00    

>>> count_data -c
INFO: cluster_name = onebox
INFO: app_name = temp
INFO: partition = all
INFO: max_batch_count = 500
INFO: timeout_ms = 5000
INFO: hash_key_filter_type = no_filter
INFO: sort_key_filter_type = no_filter
INFO: value_filter_type = no_filter
INFO: diff_hash_key = false
INFO: stat_size = false
INFO: top_count = 0
INFO: run_seconds = 0
INFO: open app scanner succeed, partition_count = 8
INFO: prepare scanners succeed, split_count = 8
INFO: processed for 1 seconds, (8/8) splits, total 0 rows, last second 0 rows
INFO: split[0]: 0 rows
INFO: split[1]: 0 rows
INFO: split[2]: 0 rows
INFO: split[3]: 0 rows
INFO: split[4]: 0 rows
INFO: split[5]: 0 rows
INFO: split[6]: 0 rows
INFO: split[7]: 0 rows
Count done, total 0 rows.

>>> count_data --precise
INFO: cluster_name = onebox
INFO: app_name = temp
INFO: partition = all
INFO: max_batch_count = 500
INFO: timeout_ms = 5000
INFO: hash_key_filter_type = no_filter
INFO: sort_key_filter_type = no_filter
INFO: value_filter_type = no_filter
INFO: diff_hash_key = false
INFO: stat_size = false
INFO: top_count = 0
INFO: run_seconds = 0
INFO: open app scanner succeed, partition_count = 8
INFO: prepare scanners succeed, split_count = 8
INFO: processed for 1 seconds, (8/8) splits, total 0 rows, last second 0 rows
INFO: split[0]: 0 rows
INFO: split[1]: 0 rows
INFO: split[2]: 0 rows
INFO: split[3]: 0 rows
INFO: split[4]: 0 rows
INFO: split[5]: 0 rows
INFO: split[6]: 0 rows
INFO: split[7]: 0 rows
Count done, total 0 rows.

>>> count_data -b 10
ERROR: you must input [-c|--precise] flag when you expect to get pricise result by scaning all record online

./run.sh bench --type fillrandom_pegasus # 14:40:46
>>> count_data # 14:42:08
[count_data]
pidx       estimate_count  
0          1.00            
1          1.00            
2          0.00            
3          0.00            
4          0.00            
5          0.00            
6          0.00            
7          0.00            
(total:8)  2.00
>>> count_data # 14:48:51
[count_data]
pidx       estimate_count  
0          1262.00         
1          1307.00         
2          1300.00         
3          1306.00         
4          1322.00         
5          1342.00         
6          1332.00         
7          1307.00         
(total:8)  10478.00

Related changes

Need to cherry-pick to the release branch (Note: the pr depends feat(collector): add statistics for estimate key number of table #437)
Need to update the documentation
Need to be included in the release note

Note

If some key-value is duplicate(usually written multiple times), the value of key number will be count multi time, so only after compaction, the value just be relatively more accurate.

src/shell/commands/data_operations.cpp

…us into refactor-count-data-style

src/shell/commands/data_operations.cpp

…t" commit (#519)

foreverneverer added 4 commits April 15, 2020 17:53

add

013435d

add

dd37448

add

b10877e

format

943978e

foreverneverer changed the title ~~Refactor count data style~~ refactor: count_data return estimate count defaultly Apr 15, 2020

foreverneverer added 2 commits April 16, 2020 09:50

format

5b89ef7

Merge branch 'master' into refactor-count-data-style

cf95705

levy5307 reviewed Apr 16, 2020

View reviewed changes

src/shell/commands/data_operations.cpp Show resolved Hide resolved

levy5307 reviewed Apr 16, 2020

View reviewed changes