Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(shell): count_data return estimate count by default #519

Merged
merged 11 commits into from
Apr 16, 2020

Conversation

foreverneverer
Copy link
Contributor

@foreverneverer foreverneverer commented Apr 15, 2020

What problem does this PR solve?

count_data need scan all online records to get precise result, which may affect
cluster availability, so here define precise = false defaultly and it will return estimate
count immediately.

What is changed and how it works?

Define precise = false defaultly, if want to get precise result, user must add [-c|--precise] explictly.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
// start onebox
>>> count_data
[count_data]
pidx       estimate_count  
0          14.00           
1          341.00          
2          381.00          
3          361.00          
4          30.00           
5          203.00          
6          63.00           
7          52.00           
(total:8)  1445.00    

>>> count_data -c
INFO: cluster_name = onebox
INFO: app_name = temp
INFO: partition = all
INFO: max_batch_count = 500
INFO: timeout_ms = 5000
INFO: hash_key_filter_type = no_filter
INFO: sort_key_filter_type = no_filter
INFO: value_filter_type = no_filter
INFO: diff_hash_key = false
INFO: stat_size = false
INFO: top_count = 0
INFO: run_seconds = 0
INFO: open app scanner succeed, partition_count = 8
INFO: prepare scanners succeed, split_count = 8
INFO: processed for 1 seconds, (8/8) splits, total 0 rows, last second 0 rows
INFO: split[0]: 0 rows
INFO: split[1]: 0 rows
INFO: split[2]: 0 rows
INFO: split[3]: 0 rows
INFO: split[4]: 0 rows
INFO: split[5]: 0 rows
INFO: split[6]: 0 rows
INFO: split[7]: 0 rows
Count done, total 0 rows.

>>> count_data --precise
INFO: cluster_name = onebox
INFO: app_name = temp
INFO: partition = all
INFO: max_batch_count = 500
INFO: timeout_ms = 5000
INFO: hash_key_filter_type = no_filter
INFO: sort_key_filter_type = no_filter
INFO: value_filter_type = no_filter
INFO: diff_hash_key = false
INFO: stat_size = false
INFO: top_count = 0
INFO: run_seconds = 0
INFO: open app scanner succeed, partition_count = 8
INFO: prepare scanners succeed, split_count = 8
INFO: processed for 1 seconds, (8/8) splits, total 0 rows, last second 0 rows
INFO: split[0]: 0 rows
INFO: split[1]: 0 rows
INFO: split[2]: 0 rows
INFO: split[3]: 0 rows
INFO: split[4]: 0 rows
INFO: split[5]: 0 rows
INFO: split[6]: 0 rows
INFO: split[7]: 0 rows
Count done, total 0 rows.

>>> count_data -b 10
ERROR: you must input [-c|--precise] flag when you expect to get pricise result by scaning all record online
./run.sh bench --type fillrandom_pegasus # 14:40:46
>>> count_data # 14:42:08
[count_data]
pidx       estimate_count  
0          1.00            
1          1.00            
2          0.00            
3          0.00            
4          0.00            
5          0.00            
6          0.00            
7          0.00            
(total:8)  2.00
>>> count_data # 14:48:51
[count_data]
pidx       estimate_count  
0          1262.00         
1          1307.00         
2          1300.00         
3          1306.00         
4          1322.00         
5          1342.00         
6          1332.00         
7          1307.00         
(total:8)  10478.00    

Related changes

Note

If some key-value is duplicate(usually written multiple times), the value of key number will be count multi time, so only after compaction, the value just be relatively more accurate.

@foreverneverer foreverneverer changed the title Refactor count data style refactor: count_data return estimate count defaultly Apr 15, 2020
@neverchanje neverchanje changed the title refactor: count_data return estimate count defaultly feat(shell): count_data return estimate count by default Apr 16, 2020
neverchanje
neverchanje previously approved these changes Apr 16, 2020
@foreverneverer foreverneverer force-pushed the refactor-count-data-style branch from bed16e2 to 8418fea Compare April 16, 2020 06:28
@acelyc111 acelyc111 merged commit fb8fc12 into apache:master Apr 16, 2020
@neverchanje neverchanje mentioned this pull request May 14, 2020
@neverchanje neverchanje mentioned this pull request Jun 10, 2020
acelyc111 pushed a commit that referenced this pull request Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants