-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#5533: Improved sort_values by reducing the number of partitions #6589
Conversation
13d3881
to
580d440
Compare
e79e325
to
ff03c40
Compare
…of partitions 1. In groupby_reduce() num_splits is limited by the number of partititons. We assume here, that gorupby should not increase the current data size. 2. Added a heuristic to the text file reader for calculating the number of partitions: num_partitions = min((num_rows * num_cols) // 64_000, NPartitions.get()) An approximate number of rows is estimated by reading the first 10 lines. Signed-off-by: Andrey Pavlenko <[email protected]>
ff03c40
to
87a81bf
Compare
@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)? |
This reproducer requires another one fix for read_csv(), that was dropped in the last commit - AndreyPavlenko@efd91bf#diff-b73ae9581d0213011834cbe1316a85876e77c2bf00b5d93e9e05be078699f04fR1090 . It was decided to implement a separate solution for read_csv(). |
Should we create another issue in this case? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
num_partitions = min((num_rows * num_cols) // 64_000, NPartitions.get())
An approximate number of rows is estimated by reading the first 10 lines.
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date