-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4641: Reindex pandas partitions in df.describe()
#4651
Conversation
@@ -1577,7 +1577,7 @@ def describe(self, **kwargs): | |||
|
|||
def describe_builder(df, internal_indices=[]): | |||
"""Apply `describe` function to the subset of columns in a single partition.""" | |||
return df.iloc[:, internal_indices].describe(**kwargs) | |||
return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be some performance implications here but I presume that it should be ok given that we're working at partition granularity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding rows with pandas reindex
to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24)))
%time df2 = df.reindex([0, 1])
The reindex takes 600 ms.
Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN
to add the new row instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the quick benchmark as a sanity. Yeah, we need the extra rows either way, and I don't know of an easier way of doing that then with reindex
either. I think the time of the reindex will also get larger pretty quickly with the number of additional rows we add. Hopefully it won't be terribly large in the case of describe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially mitigate the performance hit by checking if the index of df.describe equals the desired index and only reinforcing if it doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD is there a performance penalty for calling reindex if the index is already the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD I thought about that, but in practice it looks like the time to describe
outweights the time to reindex
with an equal index by a factor of over 1000:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 12)))
%time df2 = df.describe()
%time df2 = df2.reindex(df2.index)
I get 4.56 seconds for the describe and 282 microseconds for the reindex, so I don't think the extra optimization is worth the extra complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just tried experimenting a bit with this case and came to the same conclusion.
Codecov Report
@@ Coverage Diff @@
## master #4651 +/- ##
==========================================
+ Coverage 86.47% 89.69% +3.22%
==========================================
Files 230 231 +1
Lines 18458 18734 +276
==========================================
+ Hits 15961 16803 +842
+ Misses 2497 1931 -566
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code itself looks good, but I wonder if there are other cases where some inconsistent indexing is biting us without any notice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyrito Thank you, the fix looks good! I left some style comments.
@@ -1577,7 +1577,7 @@ def describe(self, **kwargs): | |||
|
|||
def describe_builder(df, internal_indices=[]): | |||
"""Apply `describe` function to the subset of columns in a single partition.""" | |||
return df.iloc[:, internal_indices].describe(**kwargs) | |||
return df.iloc[:, internal_indices].describe(**kwargs).reindex(new_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding rows with pandas reindex
to dataframes with many columns can be expensive. On my mac with 16 GB memory, here's pandas adding an extra row to a 1 x 2^24 frame that's about 125 MB:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(1,2 ** 24)))
%time df2 = df.reindex([0, 1])
The reindex takes 600 ms.
Still, the extra rows somehow have to go into the partitions that are missing them. I don't see a more efficient way to do that. I get similar performance if I do df2.loc[-1] = np.NaN
to add the new row instead.
Signed-off-by: Karthik Velayutham <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# The index of the resulting dataframe is the same amongst all partitions | ||
# when dealing with only numerical data. However, if we work with columns | ||
# that contain strings, we will get extra values in our result index such as | ||
# 'unique', 'top', and 'freq'. Since we call describe() on each partition, | ||
# we can have cases where certain partitions do not contain any of the | ||
# object string data. Thus, we add an extra string column to make sure | ||
# that we are setting the index correctly for all partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this comment be in the implementation (not in a test)?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's best for most of the comment to go in the implementation, and to keep a short note here saying that we're testing a case where different partitions have different describe
rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a quick question, besides that, looks good to me!
Signed-off-by: Karthik Velayutham <[email protected]>
What do these changes do?
df.describe(include='all')
can sometimes encounter cases where partitions have different indexes as a result of having different data types (e.g. categorial data returns different summaries as opposed to numerical data). Since we already know what the index of the final DataFrame should look like, we can set the correct index for the partition viareindex
. Special thanks to @mvashishtha and @RehanSD for helping me understand what was going on in the lower-levels of Modin. This was a fun one to track down.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date