Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#5816: Implement '.split' method for axis partitions #5856

Merged
merged 7 commits into from
Apr 13, 2023

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Mar 24, 2023

What do these changes do?

The PR introduces an implementation of the .split() method for axis partitions, making it possible to avoid materialization of the virtual row partitions during reshuffling.

ASV results:

MODIN_TEST_DATASET_SIZE="Big" asv continuous origin/master HEAD --launch-method=spawn -b TimeSortValues --no-only-changed -a repeat=5

All benchmarks:

       before           after         ratio
     [cd7611cd]       [9316fbcc]
     <master>       <issue_5816>
       949±50ms         910±50ms     0.96  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 10, False)
       947±40ms         898±20ms     0.95  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 1, False)
       1.03±0.04s         909±60ms    ~0.89  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 2, False)
  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Do not materialize row partitions while doing reshuffling #5816
  • tests are passing
  • module layout described at docs/development/architecture.rst is up-to-date

Positional arguments to pass to the `split_func`.
f_kwargs : dict, optional
Keyword arguments to pass to the `split_func`.
extract_metadata : bool, default: False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original .split method is already implemented in the manner of not extracting the metadata:

outputs = self.execution_wrapper.deploy(
split_func, [self._data] + list(args), num_returns=num_splits
)
self._is_debug(log) and log.debug(f"EXIT::Partition.split::{self._identity}")
return [self.__constructor__(output) for output in outputs]

So the full-axis implementation just follows the initial approach.

We don't want to extract metadata because:

  1. Partitions generated by this function are temporary, at the reshuffling flow the split_row_partitions are immediately replaced by new_partitions holding new metadata, meaning that the metadata of split_row_partitions is never accessed:
    # We need to convert every partition that came from the splits into a full-axis column partition.
    new_partitions = [
    [
    cls._column_partitions_class(row_partition, full_axis=False).apply(
    final_shuffle_func
    )
    ]
    for row_partition in split_row_partitions
    ]
  2. The splitting stage generates a lot of partitions (up to ncores ^ 2), it's already not an easy task for ray to put into storage that big amount of futures at once, the situation becomes even worse when we ask to store the metadata futures as well (4 * (ncores ^ 2) amount of futures at once). I've measured the case from [PERF] Slow sort_values in value_counts #5533 with and without the partition's metadata, and received a stable 9% speed-up (~ 0.12s) for the case without metadata.

@dchigarev dchigarev marked this pull request as ready for review March 24, 2023 22:24
@dchigarev dchigarev requested a review from a team as a code owner March 24, 2023 22:24
@@ -1608,5 +1610,5 @@ def shuffle_partitions(
else:
# If there are not pivots we can simply apply the function row-wise
return np.array(
[[row_part.apply(final_shuffle_func)] for row_part in row_partitions]
[row_part.apply(final_shuffle_func) for row_part in row_partitions]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_part is now actually a row partition returning a list, meaning there's no need to wrap this into a list no more

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @dchigarev! This is a great catch! I've left a few minor comments, but overall PR looks great to me!

@@ -1608,5 +1610,5 @@ def shuffle_partitions(
else:
# If there are not pivots we can simply apply the function row-wise
return np.array(
[[row_part.apply(final_shuffle_func)] for row_part in row_partitions]
[row_part.apply(final_shuffle_func) for row_part in row_partitions]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@dchigarev dchigarev requested a review from RehanSD April 7, 2023 12:25
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@anmyachev
Copy link
Collaborator

@RehanSD are you ok with the current changes?

Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the awesome work @dchigarev!

@RehanSD RehanSD merged commit 15667fa into modin-project:master Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not materialize row partitions while doing reshuffling
3 participants