Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#4641: Reindex pandas partitions in df.describe() #4651

Merged
merged 5 commits into from
Jul 8, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/release_notes/release_notes-0.16.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Key Features and Updates
* FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
* FIX-#4584: Enable pdb debug when running cloud tests (#4585)
* FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
* FIX-#4641: Reindex pandas partitions in `df.describe()` (#4651)
* Performance enhancements
* PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
* PERF-#4288: Improve perf of `groupby.mean` for narrow data (#4591)
Expand Down
2 changes: 1 addition & 1 deletion modin/core/storage_formats/pandas/query_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -1577,7 +1577,7 @@ def describe(self, **kwargs):

def describe_builder(df, internal_indices=[]):
"""Apply `describe` function to the subset of columns in a single partition."""
return df.iloc[:, internal_indices].describe(**kwargs)
return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some performance implications here but I presume that it should be ok given that we're working at partition granularity?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding rows with pandas reindex to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24)))
%time df2 = df.reindex([0, 1])

The reindex takes 600 ms.

Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN to add the new row instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the quick benchmark as a sanity. Yeah, we need the extra rows either way, and I don't know of an easier way of doing that then with reindex either. I think the time of the reindex will also get larger pretty quickly with the number of additional rows we add. Hopefully it won't be terribly large in the case of describe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially mitigate the performance hit by checking if the index of df.describe equals the desired index and only reinforcing if it doesn't?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RehanSD is there a performance penalty for calling reindex if the index is already the same?

Copy link
Collaborator

@mvashishtha mvashishtha Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RehanSD I thought about that, but in practice it looks like the time to describe outweights the time to reindex with an equal index by a factor of over 1000:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 12)))
%time df2 = df.describe()
%time df2 = df2.reindex(df2.index)

I get 4.56 seconds for the describe and 282 microseconds for the reindex, so I don't think the extra optimization is worth the extra complexity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just tried experimenting a bit with this case and came to the same conclusion.


return self.__constructor__(
self._modin_frame.apply_full_axis_select_indices(
Expand Down
15 changes: 15 additions & 0 deletions modin/pandas/test/dataframe/test_reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,21 @@ def test_2195(datetime_is_numeric, has_numeric_column):
)


# Issue: https://github.com/modin-project/modin/issues/4641
def test_describe_column_partition_has_different_index():
pandas_df = pandas.DataFrame(test_data["int_data"])
# The index of the resulting dataframe is the same amongst all partitions
# when dealing with only numerical data. However, if we work with columns
# that contain strings, we will get extra values in our result index such as
# 'unique', 'top', and 'freq'. Since we call describe() on each partition,
# we can have cases where certain partitions do not contain any of the
# object string data. Thus, we add an extra string column to make sure
# that we are setting the index correctly for all partitions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this comment be in the implementation (not in a test)?..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's best for most of the comment to go in the implementation, and to keep a short note here saying that we're testing a case where different partitions have different describe rows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense.

pandas_df["string_column"] = "abc"
modin_df = pd.DataFrame(pandas_df)
eval_general(modin_df, pandas_df, lambda df: df.describe(include="all"))


@pytest.mark.parametrize(
"exclude,include",
[
Expand Down