Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#4641: Reindex pandas partitions in df.describe() #4651

Merged
merged 5 commits into from
Jul 8, 2022

Conversation

pyrito
Copy link
Collaborator

@pyrito pyrito commented Jul 7, 2022

What do these changes do?

df.describe(include='all') can sometimes encounter cases where partitions have different indexes as a result of having different data types (e.g. categorial data returns different summaries as opposed to numerical data). Since we already know what the index of the final DataFrame should look like, we can set the correct index for the partition via reindex. Special thanks to @mvashishtha and @RehanSD for helping me understand what was going on in the lower-levels of Modin. This was a fun one to track down.

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves df.describe(include='all') generates error #4641
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@pyrito pyrito requested a review from a team as a code owner July 7, 2022 02:59
@@ -1577,7 +1577,7 @@ def describe(self, **kwargs):

def describe_builder(df, internal_indices=[]):
"""Apply `describe` function to the subset of columns in a single partition."""
return df.iloc[:, internal_indices].describe(**kwargs)
return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some performance implications here but I presume that it should be ok given that we're working at partition granularity?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding rows with pandas reindex to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24)))
%time df2 = df.reindex([0, 1])

The reindex takes 600 ms.

Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN to add the new row instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the quick benchmark as a sanity. Yeah, we need the extra rows either way, and I don't know of an easier way of doing that then with reindex either. I think the time of the reindex will also get larger pretty quickly with the number of additional rows we add. Hopefully it won't be terribly large in the case of describe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially mitigate the performance hit by checking if the index of df.describe equals the desired index and only reinforcing if it doesn't?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RehanSD is there a performance penalty for calling reindex if the index is already the same?

Copy link
Collaborator

@mvashishtha mvashishtha Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RehanSD I thought about that, but in practice it looks like the time to describe outweights the time to reindex with an equal index by a factor of over 1000:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 12)))
%time df2 = df.describe()
%time df2 = df2.reindex(df2.index)

I get 4.56 seconds for the describe and 282 microseconds for the reindex, so I don't think the extra optimization is worth the extra complexity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just tried experimenting a bit with this case and came to the same conclusion.

@codecov
Copy link

codecov bot commented Jul 7, 2022

Codecov Report

Merging #4651 (49cfa17) into master (05933a5) will increase coverage by 3.22%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4651      +/-   ##
==========================================
+ Coverage   86.47%   89.69%   +3.22%     
==========================================
  Files         230      231       +1     
  Lines       18458    18734     +276     
==========================================
+ Hits        15961    16803     +842     
+ Misses       2497     1931     -566     
Impacted Files Coverage Δ
...odin/core/storage_formats/pandas/query_compiler.py 96.08% <100.00%> (ø)
modin/experimental/batch/test/test_pipeline.py 100.00% <0.00%> (ø)
modin/pandas/base.py 95.30% <0.00%> (+0.08%) ⬆️
modin/experimental/xgboost/xgboost.py 87.61% <0.00%> (+0.95%) ⬆️
...mentations/pandas_on_ray/partitioning/partition.py 93.57% <0.00%> (+1.83%) ⬆️
...tations/pandas_on_python/partitioning/partition.py 93.75% <0.00%> (+2.08%) ⬆️
modin/config/envvars.py 89.10% <0.00%> (+3.46%) ⬆️
...dataframe/pandas/partitioning/partition_manager.py 90.52% <0.00%> (+3.50%) ⬆️
...entations/pandas_on_dask/partitioning/partition.py 91.46% <0.00%> (+6.09%) ⬆️
modin/error_message.py 89.36% <0.00%> (+6.38%) ⬆️
... and 16 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code itself looks good, but I wonder if there are other cases where some inconsistent indexing is biting us without any notice...

Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito Thank you, the fix looks good! I left some style comments.

@@ -1577,7 +1577,7 @@ def describe(self, **kwargs):

def describe_builder(df, internal_indices=[]):
"""Apply `describe` function to the subset of columns in a single partition."""
return df.iloc[:, internal_indices].describe(**kwargs)
return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding rows with pandas reindex to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24)))
%time df2 = df.reindex([0, 1])

The reindex takes 600 ms.

Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN to add the new row instead.

modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_reduce.py Outdated Show resolved Hide resolved
@pyrito pyrito requested a review from mvashishtha July 7, 2022 23:05
mvashishtha
mvashishtha previously approved these changes Jul 7, 2022
Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 187 to 193
# The index of the resulting dataframe is the same amongst all partitions
# when dealing with only numerical data. However, if we work with columns
# that contain strings, we will get extra values in our result index such as
# 'unique', 'top', and 'freq'. Since we call describe() on each partition,
# we can have cases where certain partitions do not contain any of the
# object string data. Thus, we add an extra string column to make sure
# that we are setting the index correctly for all partitions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this comment be in the implementation (not in a test)?..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's best for most of the comment to go in the implementation, and to keep a short note here saying that we're testing a case where different partitions have different describe rows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense.

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a quick question, besides that, looks good to me!

@mvashishtha mvashishtha merged commit 3578288 into modin-project:master Jul 8, 2022
YarShev pushed a commit that referenced this pull request Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df.describe(include='all') generates error
5 participants