PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases #6533

dchigarev · 2023-09-05T10:32:07Z

What's the problem?

The current implementation of reshuffling groupby takes the first column that was specified as the by argument (for example, if groupby(["col1", "col2"]) was called, then the "col1" will be taken) and builds range partitioning based on that picked column. The problem here is that, if the "col1" has very low cardinality (too few unique values) then there will be too few range partitions at the end and so poor cores utilization.

>>> df
  col1  col2  # 'col1' has only 2 unique values, meaning that there will be only 2
0    Y     1  # range partitions, so no matter how many cores we have, we couldn't
1    Y     2  # use more than 2 to process this groupby
2    N     3
3    Y     4

We can't do much about this problem (at least in terms of range-partitioning implementation) when grouping on a single column, however, when multiple columns are used when grouping, we can try using all the by columns to build the range partitioning and potentially increase the number of range-bins being generated.

How was it solved?

The reshuffling implementation now samples all the by columns instead of a single one in order to compute their quantile ranges. It then goes through the columns one by one and compute boundaries (pivots) for the range-bins. Once the required amount of bins is created, it terminates.

# pseudo-code of how it works
number_of_bins = 1
for col in df_sample.columns:
    ideal_num_new_pivots = int(self.ideal_num_new_partitions / number_of_bins)

    if ideal_num_new_pivots <= 2:
        # we already created the required amount of bins and can exit
        break

    # the 'pick_pivots' function may return less pivots than we requested if
    # the data cardinality is too low
    pivots = pick_pivots(df_sample[col], desired_num_pivots=ideal_num_new_pivots)
    resulted_pivots.append(pivots)
    number_of_bins *= len(pivots) + 1

Then at the splitting stage these range-bins are combined and then in the result we get more bins (some of them more likely will be empty though):

col1: [(-inf, 2), (2, inf)] # 2 bins
col2: [(-inf, 0), (0, 5), (5, 10), (10, inf)] # 4 bins
# ideally we would have 8 bins when combined
# bin1: col1 < 2 and col2 < 0
# bin2: col1 < 2 and 0 < col2 < 5
# bin3: col1 < 2 and 5 < col2 < 10
# bin4: col1 < 2 and 10 < col2
# bin5: 2 < col1 and col2 < 0
# bin6: 2 < col1 and 0 < col2 < 5
# bin7: 2 < col1 and 5 < col2 < 10
# bin8: 2 < col1 and 10 < col2

Is performance better now?

I went through some multi-column groupby cases from the benchmarks we usually run on our side (found such cases in nyc-taxi, h2o + also included our internal customer's workload for which these changes were initially intended), surprisingly to me, most of the cases had this problem with low-cardinality (it mostly only visible on MODIN_CPUS=112 though).

Under the spoilers, you may see the detailed report for each of the workload (including details regarding original data sizes and data cardinality). Here's just an aggregated table with time measurements:

The bar-charts under the spoilers show the data portions distribution between range-bins (buckets):
(if the charts are too small, you can click on them to open in full-size)

Scenario when NUM_CPUS=112

NYC-taxi cases

In the first query, the data cardinality for the first column in by is already pretty good, so the introduced changes don't make any difference:

In the second query, the first column "passenger_count" has very low cardinality (the values are in the range of 0-10) so the data distribution between buckets (bins) is quite unbalanced. After the changes, the 14% of buckets received more or less equal portions of data (quantile 0.86), although there are a lot of really small buckets (quantile 0.1), performance has improved more than 2x:

H2O cases

Here each "id*" has values in the range of (0-10), so introducing the changes helps a lot in improving the data balance between the buckets:

Actually, the situation is identical for all of the h2o multi-column queries:

Our internal workload

Here we have "col1" having really low cardinality (range from 0-3), so the changes really helps to balance the buckets:

Scenario when NUM_CPUS=16

NYC-taxi cases

Again, the data cardinality for the first by column is really good, so the changes don't make a difference:

Data cardinality for the 'passenger_count' is low (0-10), combined with unlucky sampling it used to result into having only 4 non-empty buckets and poor parallelization. The introduced changes generate much more non-empty buckets, helping to balance the load between cores:

H2O cases

In comparison with the NUM_CPUS=112 case, the cardinality of "id*" cols is just about right for the 16-cores case, so the changes don't make any difference.

Our internal workload

Here we have "col1" having really low cardinality (range from 0-3), so the changes really helps to balance the buckets even for the 16-cores case:

Check-list

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Automatically pick a key column with the most unique values when building range-partitioning #6464
tests ~~added and~~ are passing (the existing multi-column groupby tests should cover new functionality)
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Dmitry Chigarev <[email protected]>

Signed-off-by: Dmitry Chigarev <[email protected]> PERF-modin-project#6464: Improve reshuffling multi-column groupby for low-cardinality cases Signed-off-by: Dmitry Chigarev <[email protected]>

… low-cardinality cases Signed-off-by: Dmitry Chigarev <[email protected]>

modin/core/dataframe/pandas/partitioning/partition_manager.py

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-09-05T12:40:03Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -2432,17 +2435,9 @@ def _apply_func_to_range_partitioning(
                # simply combine all partitions and apply the sorting to the whole dataframe
                return self.combine_and_apply(func=func)

-        if is_numeric_dtype(self.dtypes[key_column]):


this logic was moved to the ShuffleSortFunctions.pivot_fn()

dchigarev · 2023-09-05T12:41:24Z

modin/core/dataframe/pandas/dataframe/dataframe.py

-            key_column=columns[0], func=sort_function, ascending=ascending, **kwargs
+            key_columns=columns[0], func=sort_function, ascending=ascending, **kwargs


we still pass only one column in case of .sort_values() as its performance with new functionality has not been yet tested

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-09-05T13:10:23Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -2432,56 +2435,53 @@ def _apply_func_to_range_partitioning(
                # simply combine all partitions and apply the sorting to the whole dataframe
                return self.combine_and_apply(func=func)

-        if is_numeric_dtype(self.dtypes[key_column]):
-            method = "linear"
+            if ideal_num_new_partitions < len(self._partitions):


this logic was moved under the if sampling_probability >= 1 condition so it would only take place when the ideal_num_new_partitions was actually modified

modin/core/dataframe/pandas/dataframe/dataframe.py

vnlitvinov · 2023-09-06T06:54:25Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+                        range(
+                            0,
+                            len(self._partitions),
+                            round(len(self._partitions) / ideal_num_new_partitions),


why round and not math.ceil?

Well, I didn't write the code, but just moved it a few lines above. My understanding, is that we want for ideal_num_new_partitions to be at least 1.5 times less in order to shrink partitioning, as far as I understand, this is done so we wouldn't shrink too much in case the len(parts) is just slightly bigger than ideal_num_parts:

>>> step = math.ceil(18 / 17) >>> len(np.split(arr, range(step, 18, step))) # we wanted 17 parts but got 9, too few 9 >>> step = round(18 / 17) >>> len(np.split(arr, range(step, 18, step))) # we wanted 17 parts but got 18, not good not bad 18

but just in case, ping @RehanSD as an author of this code

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/dataframe/pandas/dataframe/utils.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

modin/test/storage_formats/pandas/test_internals.py

vnlitvinov · 2023-09-06T07:10:48Z

oh, and @dchigarev - awesome PR description with very thorough details and measurements, looks super-cool and informative!

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-09-06T15:06:30Z

@vnlitvinov thanks for your review! All the comments were answered

vnlitvinov

LGTM, awesome work @dchigarev!

Let's give a day or two for other people to take a look, then we can merge if there would be no objections.

modin/core/dataframe/pandas/dataframe/dataframe.py

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev added 4 commits August 28, 2023 17:28

draft

e689324

Signed-off-by: Dmitry Chigarev <[email protected]>

Debug

e984e9b

Signed-off-by: Dmitry Chigarev <[email protected]> PERF-modin-project#6464: Improve reshuffling multi-column groupby for low-cardinality cases Signed-off-by: Dmitry Chigarev <[email protected]>

Merge remote-tracking branch 'origin/master' into issue_6464

ae9f552

PERF-modin-project#6464: Improve reshuffling multi-column groupby for…

d8abece

… low-cardinality cases Signed-off-by: Dmitry Chigarev <[email protected]>

github-advanced-security bot found potential problems Sep 5, 2023

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Fixed Show fixed Hide fixed

fix docstrings and some tests

a4b5a18

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev commented Sep 5, 2023

View reviewed changes

fix sampling

3ba317d

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev commented Sep 5, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into issue_6464

ec2e14c

dchigarev changed the title ~~PERF-#6464: Improve reshuffling multi-column groupby for low-cardinality cases~~ PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases Sep 5, 2023

dchigarev marked this pull request as ready for review September 5, 2023 19:30

dchigarev requested a review from a team as a code owner September 5, 2023 19:30

vnlitvinov reviewed Sep 6, 2023

View reviewed changes

Apply @vnlitvinov's suggestions

b679b64

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev requested a review from vnlitvinov September 6, 2023 15:06

vnlitvinov previously approved these changes Sep 6, 2023

View reviewed changes

anmyachev reviewed Sep 8, 2023

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

anmyachev reviewed Sep 8, 2023

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

anmyachev previously approved these changes Sep 8, 2023

View reviewed changes

use 'np.digitize'

9342503

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev dismissed stale reviews from anmyachev and vnlitvinov via 9342503 September 8, 2023 13:55

anmyachev approved these changes Sep 8, 2023

View reviewed changes

anmyachev merged commit 0e42667 into modin-project:master Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases #6533

PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases #6533

dchigarev commented Sep 5, 2023 •

edited

Loading

dchigarev Sep 5, 2023

dchigarev Sep 5, 2023

dchigarev Sep 5, 2023

vnlitvinov Sep 6, 2023

dchigarev Sep 6, 2023 •

edited

Loading

vnlitvinov commented Sep 6, 2023

dchigarev commented Sep 6, 2023

vnlitvinov left a comment

		key_column=columns[0], func=sort_function, ascending=ascending, **kwargs
		key_columns=columns[0], func=sort_function, ascending=ascending, **kwargs

PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases #6533

PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases #6533

Conversation

dchigarev commented Sep 5, 2023 • edited Loading

What's the problem?

How was it solved?

Is performance better now?

NYC-taxi cases

H2O cases

Our internal workload

NYC-taxi cases

H2O cases

Our internal workload

Check-list

dchigarev Sep 5, 2023

Choose a reason for hiding this comment

dchigarev Sep 5, 2023

Choose a reason for hiding this comment

dchigarev Sep 5, 2023

Choose a reason for hiding this comment

vnlitvinov Sep 6, 2023

Choose a reason for hiding this comment

dchigarev Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

vnlitvinov commented Sep 6, 2023

dchigarev commented Sep 6, 2023

vnlitvinov left a comment

Choose a reason for hiding this comment

dchigarev commented Sep 5, 2023 •

edited

Loading

dchigarev Sep 6, 2023 •

edited

Loading