PERF-#5533: Improved sort_values by reducing the number of partitions #6589

AndreyPavlenko · 2023-09-19T13:19:26Z

In groupby_reduce() num_splits is limited by the number of partititons. We assume here, that gorupby should not increase the current data size.
Added a heuristic to the text file reader for calculating the number of partitions:
num_partitions = min((num_rows * num_cols) // 64_000, NPartitions.get())
An approximate number of rows is estimated by reading the first 10 lines.

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves [PERF] Slow sort_values in value_counts #5533
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/core/io/text/text_file_dispatcher.py

…of partitions 1. In groupby_reduce() num_splits is limited by the number of partititons. We assume here, that gorupby should not increase the current data size. 2. Added a heuristic to the text file reader for calculating the number of partitions: num_partitions = min((num_rows * num_cols) // 64_000, NPartitions.get()) An approximate number of rows is estimated by reading the first 10 lines. Signed-off-by: Andrey Pavlenko <[email protected]>

anmyachev · 2023-09-28T15:41:15Z

@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)?

AndreyPavlenko · 2023-09-28T16:06:44Z

@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)?

This reproducer requires another one fix for read_csv(), that was dropped in the last commit - AndreyPavlenko@efd91bf#diff-b73ae9581d0213011834cbe1316a85876e77c2bf00b5d93e9e05be078699f04fR1090 . It was decided to implement a separate solution for read_csv().

anmyachev · 2023-09-28T16:17:57Z

@AndreyPavlenko could you provide perf results for reproducer from #5533 (comment)?

This reproducer requires another one fix for read_csv(), that was dropped in the last commit - AndreyPavlenko@efd91bf#diff-b73ae9581d0213011834cbe1316a85876e77c2bf00b5d93e9e05be078699f04fR1090 . It was decided to implement a separate solution for read_csv().

Should we create another issue in this case?

modin/core/dataframe/pandas/partitioning/partition_manager.py

AndreyPavlenko · 2023-09-29T13:53:17Z

Should we create another issue in this case?

#6616

anmyachev

LGTM!

AndreyPavlenko force-pushed the issue-5533 branch from 13d3881 to 580d440 Compare September 21, 2023 11:20

dchigarev reviewed Sep 21, 2023

View reviewed changes

modin/core/io/text/text_file_dispatcher.py Outdated Show resolved Hide resolved

AndreyPavlenko force-pushed the issue-5533 branch 2 times, most recently from e79e325 to ff03c40 Compare September 24, 2023 15:08

AndreyPavlenko marked this pull request as ready for review September 24, 2023 16:14

AndreyPavlenko requested a review from a team as a code owner September 24, 2023 16:14

AndreyPavlenko added 4 commits September 25, 2023 21:17

Moved the number of partitions calculation to utils

07637fe

Always read the column names

2ab51d0

Reverted changes in text_file_dispatcher.py

87a81bf

AndreyPavlenko force-pushed the issue-5533 branch from ff03c40 to 87a81bf Compare September 25, 2023 19:20

dchigarev previously approved these changes Sep 29, 2023

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Show resolved Hide resolved

Added comment

8842a0c

AndreyPavlenko dismissed dchigarev’s stale review via 8842a0c September 29, 2023 13:51

dchigarev approved these changes Sep 29, 2023

View reviewed changes

anmyachev approved these changes Sep 29, 2023

View reviewed changes

anmyachev merged commit 65ad735 into modin-project:master Sep 29, 2023
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#5533: Improved sort_values by reducing the number of partitions #6589

PERF-#5533: Improved sort_values by reducing the number of partitions #6589

AndreyPavlenko commented Sep 19, 2023 •

edited by dchigarev

Loading

anmyachev commented Sep 28, 2023

AndreyPavlenko commented Sep 28, 2023

anmyachev commented Sep 28, 2023

AndreyPavlenko commented Sep 29, 2023

anmyachev left a comment

PERF-#5533: Improved sort_values by reducing the number of partitions #6589

PERF-#5533: Improved sort_values by reducing the number of partitions #6589

Conversation

AndreyPavlenko commented Sep 19, 2023 • edited by dchigarev Loading

What do these changes do?

anmyachev commented Sep 28, 2023

AndreyPavlenko commented Sep 28, 2023

anmyachev commented Sep 28, 2023

AndreyPavlenko commented Sep 29, 2023

anmyachev left a comment

Choose a reason for hiding this comment

AndreyPavlenko commented Sep 19, 2023 •

edited by dchigarev

Loading