FIX-#4641: Reindex pandas partitions in `df.describe()` #4651

pyrito · 2022-07-07T02:59:41Z

What do these changes do?

df.describe(include='all') can sometimes encounter cases where partitions have different indexes as a result of having different data types (e.g. categorial data returns different summaries as opposed to numerical data). Since we already know what the index of the final DataFrame should look like, we can set the correct index for the partition via reindex. Special thanks to @mvashishtha and @RehanSD for helping me understand what was going on in the lower-levels of Modin. This was a fun one to track down.

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves df.describe(include='all') generates error #4641
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

pyrito · 2022-07-07T03:08:51Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -1577,7 +1577,7 @@ def describe(self, **kwargs):

        def describe_builder(df, internal_indices=[]):
            """Apply `describe` function to the subset of columns in a single partition."""
-            return df.iloc[:, internal_indices].describe(**kwargs)
+            return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index)


There may be some performance implications here but I presume that it should be ok given that we're working at partition granularity?

Adding rows with pandas reindex to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:

import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24))) %time df2 = df.reindex([0, 1])

The reindex takes 600 ms.

Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN to add the new row instead.

Thanks for running the quick benchmark as a sanity. Yeah, we need the extra rows either way, and I don't know of an easier way of doing that then with reindex either. I think the time of the reindex will also get larger pretty quickly with the number of additional rows we add. Hopefully it won't be terribly large in the case of describe.

We could potentially mitigate the performance hit by checking if the index of df.describe equals the desired index and only reinforcing if it doesn't?

@RehanSD is there a performance penalty for calling reindex if the index is already the same?

@RehanSD I thought about that, but in practice it looks like the time to describe outweights the time to reindex with an equal index by a factor of over 1000:

import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 12))) %time df2 = df.describe() %time df2 = df2.reindex(df2.index)

I get 4.56 seconds for the describe and 282 microseconds for the reindex, so I don't think the extra optimization is worth the extra complexity.

Yeah I just tried experimenting a bit with this case and came to the same conclusion.

codecov · 2022-07-07T03:09:31Z

Codecov Report

Merging #4651 (49cfa17) into master (05933a5) will increase coverage by 3.22%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4651      +/-   ##
==========================================
+ Coverage   86.47%   89.69%   +3.22%     
==========================================
  Files         230      231       +1     
  Lines       18458    18734     +276     
==========================================
+ Hits        15961    16803     +842     
+ Misses       2497     1931     -566

Impacted Files	Coverage Δ
...odin/core/storage_formats/pandas/query_compiler.py	`96.08% <100.00%> (ø)`
modin/experimental/batch/test/test_pipeline.py	`100.00% <0.00%> (ø)`
modin/pandas/base.py	`95.30% <0.00%> (+0.08%)`	⬆️
modin/experimental/xgboost/xgboost.py	`87.61% <0.00%> (+0.95%)`	⬆️
...mentations/pandas_on_ray/partitioning/partition.py	`93.57% <0.00%> (+1.83%)`	⬆️
...tations/pandas_on_python/partitioning/partition.py	`93.75% <0.00%> (+2.08%)`	⬆️
modin/config/envvars.py	`89.10% <0.00%> (+3.46%)`	⬆️
...dataframe/pandas/partitioning/partition_manager.py	`90.52% <0.00%> (+3.50%)`	⬆️
...entations/pandas_on_dask/partitioning/partition.py	`91.46% <0.00%> (+6.09%)`	⬆️
modin/error_message.py	`89.36% <0.00%> (+6.38%)`	⬆️
... and 16 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

vnlitvinov

Code itself looks good, but I wonder if there are other cases where some inconsistent indexing is biting us without any notice...

modin/pandas/test/dataframe/test_reduce.py

mvashishtha

@pyrito Thank you, the fix looks good! I left some style comments.

mvashishtha · 2022-07-07T20:02:02Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -1577,7 +1577,7 @@ def describe(self, **kwargs):

        def describe_builder(df, internal_indices=[]):
            """Apply `describe` function to the subset of columns in a single partition."""
-            return df.iloc[:, internal_indices].describe(**kwargs)
+            return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index)


Adding rows with pandas reindex to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:

import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24))) %time df2 = df.reindex([0, 1])

The reindex takes 600 ms.

Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN to add the new row instead.

modin/pandas/test/dataframe/test_reduce.py

Signed-off-by: Karthik Velayutham <[email protected]>

modin/pandas/test/dataframe/test_reduce.py

mvashishtha

LGTM!

vnlitvinov · 2022-07-08T08:33:06Z

modin/pandas/test/dataframe/test_reduce.py

+    # The index of the resulting dataframe is the same amongst all partitions
+    # when dealing with only numerical data. However, if we work with columns
+    # that contain strings, we will get extra values in our result index such as
+    # 'unique', 'top', and 'freq'. Since we call describe() on each partition,
+    # we can have cases where certain partitions do not contain any of the
+    # object string data. Thus, we add an extra string column to make sure
+    # that we are setting the index correctly for all partitions.


shouldn't this comment be in the implementation (not in a test)?..

Maybe it's best for most of the comment to go in the implementation, and to keep a short note here saying that we're testing a case where different partitions have different describe rows.

Yeah makes sense.

vnlitvinov

LGTM, thanks!

RehanSD

Left a quick question, besides that, looks good to me!

Signed-off-by: Karthik Velayutham <[email protected]>

pyrito requested a review from a team as a code owner July 7, 2022 02:59

pyrito force-pushed the fix/FIX-4641 branch from 43ec343 to f8dbc08 Compare July 7, 2022 03:03

pyrito requested review from devin-petersohn, mvashishtha and RehanSD July 7, 2022 03:05

pyrito commented Jul 7, 2022

View reviewed changes

pyrito force-pushed the fix/FIX-4641 branch from f8dbc08 to 2b4c706 Compare July 7, 2022 15:47

vnlitvinov reviewed Jul 7, 2022

View reviewed changes

naren-ponder reviewed Jul 7, 2022

View reviewed changes

modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved

pyrito force-pushed the fix/FIX-4641 branch from 1e6b35a to a2523ab Compare July 7, 2022 16:46

mvashishtha reviewed Jul 7, 2022

View reviewed changes

Karthik Velayutham added 3 commits July 7, 2022 17:45

FIX-modin-project#4641: Reindex pandas partition for df.describe()

cc43bd4

Signed-off-by: Karthik Velayutham <[email protected]>

Add release notes

90584e0

Test cleanup

d85c3c9

pyrito force-pushed the fix/FIX-4641 branch from f6dc99c to 6921385 Compare July 7, 2022 22:46

pyrito requested a review from mvashishtha July 7, 2022 23:05

mvashishtha reviewed Jul 7, 2022

View reviewed changes

modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved

Changes to test based on comments

5669be6

pyrito force-pushed the fix/FIX-4641 branch from 6921385 to 5669be6 Compare July 7, 2022 23:44

mvashishtha previously approved these changes Jul 7, 2022

View reviewed changes

vnlitvinov reviewed Jul 8, 2022

View reviewed changes

Moved comments to implementation

49cfa17

pyrito dismissed mvashishtha’s stale review via 49cfa17 July 8, 2022 14:42

vnlitvinov approved these changes Jul 8, 2022

View reviewed changes

RehanSD reviewed Jul 8, 2022

View reviewed changes

mvashishtha merged commit 3578288 into modin-project:master Jul 8, 2022

YarShev pushed a commit that referenced this pull request Sep 6, 2022

FIX-#4641: Reindex pandas partitions in df.describe() (#4651)

d9c9ebc

Signed-off-by: Karthik Velayutham <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#4641: Reindex pandas partitions in `df.describe()` #4651

FIX-#4641: Reindex pandas partitions in `df.describe()` #4651

pyrito commented Jul 7, 2022 •

edited

Loading

pyrito Jul 7, 2022

mvashishtha Jul 7, 2022

pyrito Jul 7, 2022

RehanSD Jul 8, 2022

pyrito Jul 8, 2022

mvashishtha Jul 8, 2022 •

edited

Loading

pyrito Jul 8, 2022

codecov bot commented Jul 7, 2022 •

edited

Loading

vnlitvinov left a comment

mvashishtha left a comment

mvashishtha Jul 7, 2022

mvashishtha left a comment

vnlitvinov Jul 8, 2022

mvashishtha Jul 8, 2022

pyrito Jul 8, 2022

vnlitvinov left a comment

RehanSD left a comment

FIX-#4641: Reindex pandas partitions in df.describe() #4651

FIX-#4641: Reindex pandas partitions in df.describe() #4651

Conversation

pyrito commented Jul 7, 2022 • edited Loading

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvashishtha Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 7, 2022 • edited Loading

Codecov Report

vnlitvinov left a comment

Choose a reason for hiding this comment

mvashishtha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvashishtha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnlitvinov left a comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

FIX-#4641: Reindex pandas partitions in `df.describe()` #4651

FIX-#4641: Reindex pandas partitions in `df.describe()` #4651

pyrito commented Jul 7, 2022 •

edited

Loading

mvashishtha Jul 8, 2022 •

edited

Loading

codecov bot commented Jul 7, 2022 •

edited

Loading