FEAT-#5816: Implement '.split' method for axis partitions #5856

dchigarev · 2023-03-24T13:44:33Z

What do these changes do?

The PR introduces an implementation of the .split() method for axis partitions, making it possible to avoid materialization of the virtual row partitions during reshuffling.

ASV results:

MODIN_TEST_DATASET_SIZE="Big" asv continuous origin/master HEAD --launch-method=spawn -b TimeSortValues --no-only-changed -a repeat=5

All benchmarks:

       before           after         ratio
     [cd7611cd]       [9316fbcc]
     <master>       <issue_5816>
       949±50ms         910±50ms     0.96  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 10, False)
       947±40ms         898±20ms     0.95  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 1, False)
       1.03±0.04s         909±60ms    ~0.89  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 2, False)

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Do not materialize row partitions while doing reshuffling #5816
tests are passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-03-24T15:16:14Z

modin/core/dataframe/pandas/partitioning/axis_partition.py

+            Positional arguments to pass to the `split_func`.
+        f_kwargs : dict, optional
+            Keyword arguments to pass to the `split_func`.
+        extract_metadata : bool, default: False


The original .split method is already implemented in the manner of not extracting the metadata:

modin/modin/core/dataframe/pandas/partitioning/partition.py

Lines 394 to 398 in cd7611c

outputs = self.execution_wrapper.deploy(

split_func, [self._data] + list(args), num_returns=num_splits

)

self._is_debug(log) and log.debug(f"EXIT::Partition.split::{self._identity}")

return [self.__constructor__(output) for output in outputs]

So the full-axis implementation just follows the initial approach.

We don't want to extract metadata because:

Partitions generated by this function are temporary, at the reshuffling flow the split_row_partitions are immediately replaced by new_partitions holding new metadata, meaning that the metadata of split_row_partitions is never accessed:

modin/modin/core/dataframe/pandas/partitioning/partition_manager.py

Lines 1600 to 1608 in cd7611c

# We need to convert every partition that came from the splits into a full-axis column partition.

new_partitions = [

[

cls._column_partitions_class(row_partition, full_axis=False).apply(

final_shuffle_func

)

]

for row_partition in split_row_partitions

]

The splitting stage generates a lot of partitions (up to ncores ^ 2), it's already not an easy task for ray to put into storage that big amount of futures at once, the situation becomes even worse when we ask to store the metadata futures as well (4 * (ncores ^ 2) amount of futures at once). I've measured the case from [PERF] Slow sort_values in value_counts #5533 with and without the partition's metadata, and received a stable 9% speed-up (~ 0.12s) for the case without metadata.

dchigarev · 2023-03-30T09:26:57Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

@@ -1608,5 +1610,5 @@ def shuffle_partitions(
        else:
            # If there are not pivots we can simply apply the function row-wise
            return np.array(
-                [[row_part.apply(final_shuffle_func)] for row_part in row_partitions]
+                [row_part.apply(final_shuffle_func) for row_part in row_partitions]


row_part is now actually a row partition returning a list, meaning there's no need to wrap this into a list no more

Makes sense!

RehanSD

Thank you so much @dchigarev! This is a great catch! I've left a few minor comments, but overall PR looks great to me!

modin/core/dataframe/pandas/partitioning/partition_manager.py

RehanSD · 2023-04-06T10:49:50Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

@@ -1608,5 +1610,5 @@ def shuffle_partitions(
        else:
            # If there are not pivots we can simply apply the function row-wise
            return np.array(
-                [[row_part.apply(final_shuffle_func)] for row_part in row_partitions]
+                [row_part.apply(final_shuffle_func) for row_part in row_partitions]


Makes sense!

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py

...n/core/execution/unidist/implementations/pandas_on_unidist/partitioning/virtual_partition.py

modin/core/dataframe/base/partitioning/axis_partition.py

Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev

LGTM!

anmyachev · 2023-04-13T16:47:59Z

@RehanSD are you ok with the current changes?

RehanSD

LGTM! Thank you for the awesome work @dchigarev!

dchigarev added 4 commits March 24, 2023 13:41

FEAT-modin-project#5816: Implement '.split' method for axis partitions

c9aa819

Signed-off-by: Dmitry Chigarev <[email protected]>

Merge remote-tracking branch 'origin/master' into issue_5816

7234d4f

fix docstyle

609ad88

Signed-off-by: Dmitry Chigarev <[email protected]>

try fix doc building

9316fbc

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev commented Mar 24, 2023

View reviewed changes

dchigarev marked this pull request as ready for review March 24, 2023 22:24

dchigarev requested a review from a team as a code owner March 24, 2023 22:24

Merge remote-tracking branch 'origin/master' into issue_5816

42b2250

dchigarev commented Mar 30, 2023

View reviewed changes

RehanSD requested changes Apr 6, 2023

View reviewed changes

dchigarev added 2 commits April 6, 2023 15:28

Merge remote-tracking branch 'origin/master' into issue_5816

d0c70e2

Apply review suggestions

d23b4f8

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev requested a review from RehanSD April 7, 2023 12:25

anmyachev approved these changes Apr 13, 2023

View reviewed changes

RehanSD approved these changes Apr 13, 2023

View reviewed changes

RehanSD merged commit 15667fa into modin-project:master Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#5816: Implement '.split' method for axis partitions #5856

FEAT-#5816: Implement '.split' method for axis partitions #5856

dchigarev commented Mar 24, 2023 •

edited

Loading

dchigarev Mar 24, 2023

dchigarev Mar 30, 2023

RehanSD Apr 6, 2023

RehanSD left a comment

RehanSD Apr 6, 2023

anmyachev left a comment

anmyachev commented Apr 13, 2023

RehanSD left a comment

	outputs = self.execution_wrapper.deploy(
	split_func, [self._data] + list(args), num_returns=num_splits
	)
	self._is_debug(log) and log.debug(f"EXIT::Partition.split::{self._identity}")
	return [self.__constructor__(output) for output in outputs]

	# We need to convert every partition that came from the splits into a full-axis column partition.
	new_partitions = [
	[
	cls._column_partitions_class(row_partition, full_axis=False).apply(
	final_shuffle_func
	)
	]
	for row_partition in split_row_partitions
	]

FEAT-#5816: Implement '.split' method for axis partitions #5856

FEAT-#5816: Implement '.split' method for axis partitions #5856

Conversation

dchigarev commented Mar 24, 2023 • edited Loading

What do these changes do?

dchigarev Mar 24, 2023

Choose a reason for hiding this comment

dchigarev Mar 30, 2023

Choose a reason for hiding this comment

RehanSD Apr 6, 2023

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

RehanSD Apr 6, 2023

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

anmyachev commented Apr 13, 2023

RehanSD left a comment

Choose a reason for hiding this comment

dchigarev commented Mar 24, 2023 •

edited

Loading