-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4641: Reindex pandas partitions in df.describe()
#4651
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -181,6 +181,21 @@ def test_2195(datetime_is_numeric, has_numeric_column): | |
) | ||
|
||
|
||
# Issue: https://github.com/modin-project/modin/issues/4641 | ||
def test_describe_column_partition_has_different_index(): | ||
pandas_df = pandas.DataFrame(test_data["int_data"]) | ||
# The index of the resulting dataframe is the same amongst all partitions | ||
# when dealing with only numerical data. However, if we work with columns | ||
# that contain strings, we will get extra values in our result index such as | ||
# 'unique', 'top', and 'freq'. Since we call describe() on each partition, | ||
# we can have cases where certain partitions do not contain any of the | ||
# object string data. Thus, we add an extra string column to make sure | ||
# that we are setting the index correctly for all partitions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't this comment be in the implementation (not in a test)?.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it's best for most of the comment to go in the implementation, and to keep a short note here saying that we're testing a case where different partitions have different There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah makes sense. |
||
pandas_df["string_column"] = "abc" | ||
modin_df = pd.DataFrame(pandas_df) | ||
eval_general(modin_df, pandas_df, lambda df: df.describe(include="all")) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"exclude,include", | ||
[ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be some performance implications here but I presume that it should be ok given that we're working at partition granularity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding rows with pandas
reindex
to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:The reindex takes 600 ms.
Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do
df2.loc[-1] = np.NaN
to add the new row instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the quick benchmark as a sanity. Yeah, we need the extra rows either way, and I don't know of an easier way of doing that then with
reindex
either. I think the time of the reindex will also get larger pretty quickly with the number of additional rows we add. Hopefully it won't be terribly large in the case ofdescribe
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially mitigate the performance hit by checking if the index of df.describe equals the desired index and only reinforcing if it doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD is there a performance penalty for calling reindex if the index is already the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD I thought about that, but in practice it looks like the time to
describe
outweights the time toreindex
with an equal index by a factor of over 1000:I get 4.56 seconds for the describe and 282 microseconds for the reindex, so I don't think the extra optimization is worth the extra complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just tried experimenting a bit with this case and came to the same conclusion.