PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

noloerino · 2022-07-18T21:36:19Z

What do these changes do?

Computes widths and lengths of block partitions in parallel as batched calls to ray.get/DaskWrapper.materialize rather than in serial.

This adds the try_build_[length|width]_cache and try_set_[length|width]_cache methods to block partitions; the former returns a promise/future for computing the partition's length, and the latter should be called by the partition manager to inform the block partition of the computation's value. This also adds the _update_partition_dimension_caches to the PartitionManager class, which will call the length/width futures returned by its constituent partitions.

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves PERF: get all partition widths/lengths in parallel instead of serially. #4494
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

codecov · 2022-07-19T01:32:43Z

Codecov Report

Merging #4683 (490778c) into master (8e1190c) will decrease coverage by 13.12%.
The diff coverage is 67.93%.

@@             Coverage Diff             @@
##           master    #4683       +/-   ##
===========================================
- Coverage   85.28%   72.15%   -13.13%     
===========================================
  Files         259      259               
  Lines       19378    19496      +118     
===========================================
- Hits        16527    14068     -2459     
- Misses       2851     5428     +2577

Impacted Files	Coverage Δ
...s/pandas_on_dask/partitioning/virtual_partition.py	`62.99% <0.00%> (-23.74%)`	⬇️
...ns/pandas_on_ray/partitioning/virtual_partition.py	`71.66% <6.66%> (-16.07%)`	⬇️
...lementations/pandas_on_dask/dataframe/dataframe.py	`80.76% <25.00%> (-15.07%)`	⬇️
...dataframe/pandas/partitioning/partition_manager.py	`75.67% <75.00%> (-10.79%)`	⬇️
...entations/pandas_on_dask/partitioning/partition.py	`79.77% <81.81%> (-9.25%)`	⬇️
...plementations/pandas_on_ray/dataframe/dataframe.py	`84.44% <82.50%> (-15.56%)`	⬇️
modin/core/dataframe/pandas/dataframe/dataframe.py	`71.44% <100.00%> (-22.89%)`	⬇️
...in/core/dataframe/pandas/partitioning/partition.py	`100.00% <100.00%> (ø)`
...mentations/pandas_on_ray/partitioning/partition.py	`91.66% <100.00%> (+0.51%)`	⬆️
...ns/pandas_on_ray/partitioning/partition_manager.py	`83.50% <100.00%> (+2.68%)`	⬆️
... and 84 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

noloerino · 2022-07-20T19:04:47Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py

@@ -377,6 +377,20 @@ def length(self):
        """
        if self._length_cache is None:
            if self.axis == 0:
+                caches = [


This logic is duplicated from the PartitionManager classes above, but I'm not sure how to access the correct partition manager from here.

pyrito · 2022-07-21T02:18:12Z

Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?

noloerino · 2022-07-21T02:28:19Z

Sadly no, and I’d appreciate some suggestions on what code to run. Rehan suggested manually invalidating the ._row_lengths_cache and .length_cache fields on a dataframe and its partitions, then ensuring they’re recomputed properly. It succeeds for simple examples, but I had trouble producing a Ray timeline, and I’m not sure how else to benchmark it (most API-level dataframe manipulations would probably hit the cached length/width).

…

On Wed, Jul 20, 2022 at 19:18 Karthik Velayutham ***@***.***> wrote: Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master? — Reply to this email directly, view it on GitHub <#4683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFFY4GR46CDY7NCNZ722GSDVVCXO7ANCNFSM535Z7DMA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

modin/core/dataframe/pandas/partitioning/partition.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

mvashishtha · 2022-07-21T05:06:26Z

@noloerino @pyrito

Sadly no, and I’d appreciate some suggestions on what code to run.

I spent a while today trying to get a script that showcases the performance here without breaking anything in Modin, but I failed. Getting a reproducer is hard for a few reasons.

For one thing, this optimization is only useful for unusual cases like in #4493 where the partitions' call queues include costly operations. When there is no call queue, the partitions will execute all dataframe functions eagerly, simultaneously calculating shapes. The call queues are generally meant to carry cheap operations like transpose and reindexing, but the reproducer in that issue has a frame that is very expensive to serialize, so that even the transpose was expensive. There the slow code was in _copartition, which unnecessarily calculated the widths of the base frame. #4495 fixed that unnecessary recalculation, so that script no longer works. Also, every PandasDataFrame computes all the lengths when it filters empty subframes as soon as it's constructed here, so any Modin dataframe at rest already knows its partition shapes.

Looking at all the serial shape computations I listed here, most are in internal length computations. One is _copartition, and I spent a while trying to get around the cache fix in #4495 with a pair of frames that really needed copartitioning, but in that case the map_axis_partitions in _copartition triggers parallel computation. The last type of length computation is in apply_func_to_indices_both_axis, which as far as I can tell is only used in melt. We could try engineering an example that bypasses the cache for melt, but I don't think it's worth the time...

I think it's good practice to get multiple ray objects in parallel (see also this note about a similar improvement in _to_pandas). Also, if our caches fail for any reason later on, we can have faster length computation as a backup.

vnlitvinov · 2022-07-21T06:28:06Z

This adds a certain bit of complexity (judging by the number of lines change, haven't looked at the diff yet), and I haven't yet seen any performance proof for that. I would like to see some measurements before increasing our (already huge) codebase...

RehanSD

Left some comments, but great work!

RehanSD · 2022-07-26T20:59:58Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

@@ -1214,6 +1233,19 @@ def apply_func_to_indices_both_axis(
        if col_widths is None:
            col_widths = [None] * len(col_partitions_list)

+        if row_lengths is None and col_widths is None:


Why do we need to compute dimensions here?

The length and width values of each partition are accessed in the local compute_part_size, defined immediately below. The double for loop structure where compute_part_size is called makes it hard to parallelize the computation of these dimensions, so I thought it would be simplest to precompute the relevant dimensions before the loop.

RehanSD · 2022-07-26T21:02:24Z

modin/core/dataframe/pandas/partitioning/partition.py

@@ -273,13 +274,42 @@ def length(self):
        int
            The length of the object.
        """
+        self.try_build_length_cache()
+        return self._length_cache


We need to unwrap _length_cache here, since its type will be PandasDataframePartition

What do you mean by unwrap? Also, as far as I can tell, the logic for this method should be the same as it originally was (the code was just moved into the try_build_length_cache, so does this mean the original code returned PandasDataframePartition as well?

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py

RehanSD · 2022-07-26T21:05:03Z

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py

+                for i, cache in enumerate(caches):
+                    if isinstance(cache, Future):
+                        self.list_of_partitions_to_combine[i].try_set_length_cache(
+                            new_lengths[dask_idx]


Shouldn't this just be i as well?

No, since new_lengths may have fewer elements than caches in the case where some length values were already computed (and are filtered out by the isinstance(cache, Future) check). The value computed at new_lengths[dask_idx] should correspond to the promise at caches[i].

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/partition.py

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py

noloerino · 2022-07-26T23:56:13Z

@vnlitvinov that makes sense, I'll look into coming up with concrete benchmarks.

vnlitvinov · 2022-07-27T13:03:44Z

@pyrito please have a look at https://github.com/vnlitvinov/modin/tree/speedup-masking and #4726, it might be doing somewhat the same in terms of getting the sizes in parallel

YarShev · 2022-07-27T18:39:08Z

Related discussion on handling metadata (index and columns) in #3673.

Signed-off-by: Jonathan Shi <[email protected]>

…end) Signed-off-by: Jonathan Shi <[email protected]>

Signed-off-by: Jonathan Shi <[email protected]>

Co-authored-by: Rehan Sohail Durrani <[email protected]>

Signed-off-by: Jonathan Shi <[email protected]>

noloerino requested a review from a team as a code owner July 18, 2022 21:36

noloerino marked this pull request as draft July 18, 2022 21:36

noloerino commented Jul 20, 2022

View reviewed changes

noloerino marked this pull request as ready for review July 20, 2022 21:24

jeffreykennethli reviewed Jul 21, 2022

View reviewed changes

RehanSD requested changes Jul 26, 2022

View reviewed changes

noloerino force-pushed the parallel-dims branch from 6338bcb to 3a220f7 Compare July 27, 2022 00:08

noloerino force-pushed the parallel-dims branch 2 times, most recently from 6a17fc3 to e0bb5fa Compare August 8, 2022 23:52

noloerino and others added 13 commits August 9, 2022 11:42

PERF-modin-project#4494: Get all partition widths/lengths in parallel

cb4f35c

Signed-off-by: Jonathan Shi <[email protected]>

Fix ray.get dimension

7183383

Signed-off-by: Jonathan Shi <[email protected]>

Parallelize nested ray get

5e16ccf

Signed-off-by: Jonathan Shi <[email protected]>

Add parallel length/width to ray partitions

80fa12c

Signed-off-by: Jonathan Shi <[email protected]>

Attempt at parallelizing dask length/width

3320f27

Signed-off-by: Jonathan Shi <[email protected]>

Lint

be0a146

Signed-off-by: Jonathan Shi <[email protected]>

Fix docstrings

09df9a5

Signed-off-by: Jonathan Shi <[email protected]>

Make _copartition + mgr parallel (test_general failing on Python back…

af21d50

…end) Signed-off-by: Jonathan Shi <[email protected]>

Fix reindexed_base ref

b7c3471

Signed-off-by: Jonathan Shi <[email protected]>

Fix accidental length for width

5729f18

Signed-off-by: Jonathan Shi <[email protected]>

Apply suggestions from code review

3da8073

Co-authored-by: Rehan Sohail Durrani <[email protected]>

First round of PR comments

2d4a8d3

Signed-off-by: Jonathan Shi <[email protected]>

Fix missing ray import

b10f6f5

Signed-off-by: Jonathan Shi <[email protected]>

Fix lints

490778c

Signed-off-by: Jonathan Shi <[email protected]>

noloerino force-pushed the parallel-dims branch from e0bb5fa to 490778c Compare August 9, 2022 20:21

mvashishtha mentioned this pull request Aug 10, 2022

PERF: get all partition widths/lengths in parallel instead of serially. #4494

Closed

mvashishtha marked this pull request as draft August 10, 2022 16:51

vnlitvinov mentioned this pull request Aug 26, 2022

REFACTOR: remove redefinition of _row_lengths and _column_widths functions in PandasOnDaskDataframe. #3780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

noloerino commented Jul 18, 2022 •

edited

Loading

codecov bot commented Jul 19, 2022 •

edited

Loading

noloerino Jul 20, 2022

pyrito commented Jul 21, 2022

noloerino commented Jul 21, 2022 via email

mvashishtha commented Jul 21, 2022 •

edited

Loading

vnlitvinov commented Jul 21, 2022

RehanSD left a comment

RehanSD Jul 26, 2022

noloerino Jul 26, 2022

RehanSD Jul 26, 2022

noloerino Jul 26, 2022

RehanSD Jul 26, 2022

noloerino Jul 26, 2022

noloerino commented Jul 26, 2022

vnlitvinov commented Jul 27, 2022 •

edited

Loading

YarShev commented Jul 27, 2022

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

Are you sure you want to change the base?

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

Conversation

noloerino commented Jul 18, 2022 • edited Loading

What do these changes do?

codecov bot commented Jul 19, 2022 • edited Loading

Codecov Report

noloerino Jul 20, 2022

Choose a reason for hiding this comment

pyrito commented Jul 21, 2022

noloerino commented Jul 21, 2022 via email

mvashishtha commented Jul 21, 2022 • edited Loading

vnlitvinov commented Jul 21, 2022

RehanSD left a comment

Choose a reason for hiding this comment

RehanSD Jul 26, 2022

Choose a reason for hiding this comment

noloerino Jul 26, 2022

Choose a reason for hiding this comment

RehanSD Jul 26, 2022

Choose a reason for hiding this comment

noloerino Jul 26, 2022

Choose a reason for hiding this comment

RehanSD Jul 26, 2022

Choose a reason for hiding this comment

noloerino Jul 26, 2022

Choose a reason for hiding this comment

noloerino commented Jul 26, 2022

vnlitvinov commented Jul 27, 2022 • edited Loading

YarShev commented Jul 27, 2022

noloerino commented Jul 18, 2022 •

edited

Loading

codecov bot commented Jul 19, 2022 •

edited

Loading

mvashishtha commented Jul 21, 2022 •

edited

Loading

vnlitvinov commented Jul 27, 2022 •

edited

Loading