-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Surprisingly large memory usage in groupby (but maybe I have the wrong mental model?) #37139
Comments
Here's a slightly more detailed view - taking a view is cheap (circa 0MB as expected) but using the index from the
|
I was able to verify these results with Scalene, a Python profiler that simultaneously tracks CPU execution time (split into Python & native), as well as memory consumption and (crucially here) copy volume. The copy volume column makes it clear that lines 30, 34, and 35 are doing a lot of copying, confirming @ianozsvald's hypothesis. $ scalene dataframe_example2.py
Memory usage: ▅███▆▆▆▆▆▆▆▆▆ (max: 5706.79MB)
dataframe_example.py: % of time = 100.00% out of 9.81s.
Line │Time % │Time % │Sys │Mem % │Net │Memory usage │Copy │
│Python │native │% │Python │(MB) │over time / % │(MB/s) │dataframe_example.py
━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
... │ │ │ │ │ │ │ │
8 │ │ │ │ │ │ │ │@profile
9 │ │ │ │ │ │ │ │def run():
10 │ │ │ │ │ │ │ │ # make a big dataframe with an indicator column and lots
│ │ │ │ │ │ │ │of random data
11 │ │ │ │ │ │ │ │ # use 20 columns to make it clear we have "a chunk of
│ │ │ │ │ │ │ │data"
12 │ │ │ │ │ │ │ │ # float64 * 10M is 80M, for 20 rows this is 80M*20 circa
│ │ │ │ │ │ │ │1,600MB
13 │ │ 29% │ │ │ 1526 │▁ │ │ arr = np.random.random((SIZE, 20))
14 │ │ │ │ │ │ │ │ print(f"{arr.shape} shape for our array")
15 │ │ │ │ │ │ │ │ df = pd.DataFrame(arr)
16 │ │ │ │ │ │ │ │ cols_to_agg = list(df.columns) # get [0, 1, 2,...]
17 │ │ │ │ │ │ │ │
18 │ │ │ │ │ │ │ │ # (0, 10] range for 10 indicator ints for grouping
19 │ │ 3% │ 0% │ │ 154 │▁▁ │ 8 │ df['indicator'] = np.random.randint(0, 10, SIZE)
20 │ │ │ │ │ │ │ │ print("df.head():")
21 │ 1% │ │ 0% │ 85% │ -73 │▁▁▁▁▁▁▁ │ │ print(df.head())
22 │ │ │ │ 100% │ 3 │▁▁▁ │ │ print("Memory usage:\n", df.memory_usage())
23 │ │ │ │ │ │ │ │
24 │ │ │ │ │ │ │ │ # calculate summary statistic across grouped rows by all
│ │ │ │ │ │ │ │columns
25 │ │ │ │ │ │ │ │ gpby = df.groupby('indicator')
26 │ │ 19% │ │ │ 1714 │▂▂▂▂▂ │ │ means = gpby.mean()
27 │ │ │ │ │ │ │ │ print(f"Mean shape: {means.shape}") # (10, 20) for 10
│ │ │ │ │ │ │ │indicators and 20 columns
28 │ │ │ │ │ │ │ │
29 │ 1% │ 3% │ │ │ 244 │▁▁▁▁▁▁▁▁▁ │ │ gp0_indexes = gpby.groups[0]
30 │ 2% │ 20% │ │ │ 1670 │▂▂▂▂▂▃▃▃▃ │ 33 │ manual_lookup_mean = df.loc[gp0_indexes,
│ │ │ │ │ │ │ │cols_to_agg].mean()
31 │ │ │ │ │ │ │ │ print(manual_lookup_mean.shape)
32 │ │ │ │ 100% │ 1 │▁ │ │ np.testing.assert_allclose(manual_lookup_mean,
│ │ │ │ │ │ │ │means.loc[0])
33 │ │ │ │ │ │ │ │
34 │ 1% │ 8% │ │ │-1625 │▁▁▁▁▁▁▁ │ 16 │ gp0 = gpby.get_group(0)
35 │ 1% │ 3% │ 0% │ │ 324 │▁▁▁▁▁ │ 16 │ manual_lookup_mean2 = gp0[cols_to_agg].mean()
36 │ │ │ │ │ │ │ │ np.testing.assert_allclose(manual_lookup_mean2,
│ │ │ │ │ │ │ │means.loc[0])
37 │ │ │ │ │ │ │ │ #breakpoint()
38 │ │ │ │ │ │ │ │ return df, gpby, means
... │ │ │ │ │ │ │ │ |
Thanks for investigating this @ianozsvald and @emeryberger! Contributions to avoid any unnecessary copying here would certainly be welcome. |
This seems to be fixed by the new Copy on Write optimizations on main. More info on how to enable this can be found here. |
Confirming that enabling Copy-on-Write indeed has a substantial impact on reducing memory consumption. Nicely done! If this bug report helped motivate this optimization in any way, please let us know! Before: Without copy-on-write (as before), peak memory consumption was 3.773GB. Line 34 consumes 1.671 GB. After: Following the recommendations of the above-linked web site, I added the following copy-on-write directives before running the pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True Peak memory consumption drops to 2.63GB, and line 34 now consumes just 186MB (roughly 10% of what it consumed previously). The results of the computation appear to be identical. |
Whilst teaching "more efficient Pandas" I dug into memory usage in a
groupby
withmemory_profiler
and was surprised by the output (below). For a 1.6GB DataFrame of random data with a indicator column (all random data), a groupby on the indicator generates a result that doubles the RAM usage. I was surprised that the groupby would take a further 1.6GB during the operation when the result is a tiny DataFrame.In this case there are 20 columns by 1 million rows of random floats with a 21st column as an int indicator in the range [0, 9], a gropuby on this creates 10 groups resulting in a mean groupby result of 10 rows by 20 columns. This works as expected.
The groupby operation, shown further below with
memory_profiler
, seems to make a copy of each group before performing a mean, so the total groupby costs a further 1.6GB. I'd have expected that a light reference was taken to the underlying data rather than (apparently, but maybe I read this incorrectly?) a copy being taken. I've also taken out a single group in further lines of code and each group costs 1/10th of the RAM (160-200MB) which gives some further evidence that a copy is being taken.Is my mental model wrong? Is it expected that a copy is taken of each group? Is there a way to run this code with a smaller total RAM envelope?
Code Sample, a copy-pastable example
Output of
pd.show_versions()
In [3]: pd.show_versions()
INSTALLED VERSIONS
commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.1-050801-generic
Version : #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.1.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200917
Cython : None
pytest : 6.1.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.2
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2
The text was updated successfully, but these errors were encountered: