Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: pandas.core.sorting.compress_group_index for already sorted values #53806

Merged
merged 2 commits into from
Jun 23, 2023

Conversation

lukemanley
Copy link
Member

This improves performance for a number of MultiIndex/multi-column operations (e.g. sorting, groupby, unstack) where the index/column values are already sorted.

import pandas as pd
import numpy as np

mi = pd.MultiIndex.from_product([range(1000), range(1000)], names=["A", "B"])
ser = pd.Series(np.random.randn(len(mi)), index=mi)
df = ser.to_frame("value").reset_index()

%timeit df.sort_values(["A", "B"])
# 274 ms ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 145 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

%timeit ser.sort_index()
# 267 ms ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 104 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

%timeit ser.groupby(["A", "B"]).size()
# 302 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 154 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

%timeit ser.unstack()
# 274 ms ± 6.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 128 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

Checked a few existing ASVs:

asv continuous -f 1.1 upstream/main compress-group-index -b ^reshape

       before           after         ratio
     <main>           <compress-group-index>
-        16.1±1ms       12.9±0.5ms     0.80  reshape.ReshapeExtensionDtype.time_stack('Period[s]')
-     3.65±0.08ms       1.96±0.2ms     0.54  reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
-      3.72±0.2ms       1.93±0.1ms     0.52  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
asv continuous -f 1.1 upstream/main compress-group-index -b ^multiindex_object

       before           after         ratio
     <main>           <compress-group-index>
-        44.5±4ms         37.1±2ms     0.84  multiindex_object.SetOperations.time_operation('monotonic', 'string', 'symmetric_difference', False)
-      59.8±0.8ms         41.9±3ms     0.70  multiindex_object.SetOperations.time_operation('monotonic', 'int', 'intersection', None)
-        60.7±2ms         41.7±2ms     0.69  multiindex_object.SetOperations.time_operation('monotonic', 'ea_int', 'intersection', None)
-      65.2±0.9ms         43.7±3ms     0.67  multiindex_object.SetOperations.time_operation('monotonic', 'datetime', 'intersection', None)
-      46.9±0.7ms         31.0±3ms     0.66  multiindex_object.SetOperations.time_operation('monotonic', 'int', 'union', None)
-      8.81±0.8ms       5.28±0.3ms     0.60  multiindex_object.Difference.time_difference('datetime')
-        53.8±3ms         32.0±4ms     0.60  multiindex_object.SetOperations.time_operation('monotonic', 'ea_int', 'union', None)
-      7.97±0.4ms       4.74±0.4ms     0.59  multiindex_object.Difference.time_difference('int')
-        52.4±3ms         29.8±2ms     0.57  multiindex_object.SetOperations.time_operation('monotonic', 'datetime', 'union', None)
asv continuous -f 1.1 upstream/main compress-group-index -b ^join_merge.MergeMultiIndex

       before           after         ratio
     <main>           <compress-group-index>
-         183±3ms          159±2ms     0.87  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'outer')
-         177±4ms          147±2ms     0.84  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'outer')
-        206±10ms          167±3ms     0.81  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'outer')

@lukemanley lukemanley added Performance Memory or execution speed performance MultiIndex labels Jun 22, 2023
@lukemanley lukemanley added this to the 2.1 milestone Jun 22, 2023
@mroeschke mroeschke merged commit 3a8f354 into pandas-dev:main Jun 23, 2023
@mroeschke
Copy link
Member

Nice find thanks @lukemanley

Is there a further fastpath if group_index is completely unique (like a np.arange array)?

@lukemanley lukemanley deleted the compress-group-index branch July 1, 2023 00:37
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…ues (pandas-dev#53806)

* PERF: pandas.core.sorting.compress_group_index for already sorted values

* whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants