[FEA] Help with workflow that reaches column size limit #10836

pretzelpy · 2022-05-12T02:23:18Z

I want to use cudf to load multiple large parquets into python workflow. Once combined, I need be able to melt, pivot, reshape, join, the combined gdfs. The dataset's I am working has customer name (string field with 10-80 characters per row, and my combined dataset is already 500m rows long and will get longer.

Below is the runtime error I am getting when trying to melt. The gdf i am trying to melt that throws the following error is only 2.2M rows.

""Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1072-aws x86_64v)
(rapids-22.04) ubuntu@ip-10-195-51-131:~/holding$ python3 cudf_test.py

Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/column/column.py", line 2401, in concat_columns
col = libcudf.concat.concat_columns(objs)
File "cudf/_lib/concat.pyx", line 37, in cudf._lib.concat.concat_columns
File "cudf/_lib/concat.pyx", line 41, in cudf._lib.concat.concat_columns
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/copying/concatenate.cu:391: Total number of concatena ted chars exceeds size_type range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "cudf_test.py", line 71, in
df_fpl_fetch = cudf_melt(frame=df_fpl_fetch, id_vars=index_col, var_name='sql_field',value_name='amount')
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 556, in melt
mdata = {col: _tile(frame[col], K) for col in id_vars}
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 556, in
mdata = {col: _tile(frame[col], K) for col in id_vars}
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 551, in _tile
return cudf.Series._concat(objs=series_list, index=None)
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/series.py", line 1295, in _conca t
col = concat_columns([o._column for o in objs])
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/column/column.py", line 2404, in concat_columns
raise OverflowError(
OverflowError: total size of output is too large for a cudf column ""

Describe the solution you'd like
Ideally strings could be temporarily converted to a unique int64, and the pairing of each int64 alias to string value is maintained in 2 col frame. When I'm finished manipulating my dataset, I could do a join/row update to replace the int64 alias with the actual string that was originally in my dataset. Then could be a last step before saving transformed gdf to parquet or inserting to db table.

Describe alternatives you've considered
I am struggling to understand the issue, all I know if that cudf has a 2B limit, and that includes unique string characters. Even I if can find a workaround for this long string field, my dataset will be longer than 2B row with full run. so I probably also need to understand whether handling this long string column will be meaningless if I run into a more fundamental gdf limit on more standard int/float columns.

Additional context
I need a solution that allows me to use EC2 and python to handle large datasets, cudf has worked wonders in the past, but it was a 4 col time series gdf.... But if pandas can outperform a large dataset melt, then the overhead of using gpu and unique cudf syntax isn't a solution. :(

jrhemstad · 2022-05-12T02:51:25Z

I am struggling to understand the issue, all I know if that cudf has a 2B limit, and that includes unique string characters. Even I if can find a workaround for this long string field, my dataset will be longer than 2B row with full run.

I believe you are already aware of the issue. As you say, the number of elements in a column is limited to 2^31 - 1. A column of strings is represented internally with a column of characters. This means a column with 1 string of 2^31 characters would exceed the limit or a column of 2^31 strings with 1 character each and everything in between.

The solution is to use dask-cudf which will partition your input data into smaller pieces, each of which will fit within the size limit. dask-cudf handles the orchestration of executing operations across all of your partitions. The benefit of using dask-cudf is that the same code that will help you overcome the 2GB size limit today will enable you to easily scale up to 20TB.

github-actions · 2022-06-11T03:25:06Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

GregoryKimball · 2022-06-28T05:29:34Z

For more information see #3958

pretzelpy added Needs Triage Need team to review and classify feature request New feature or request labels May 12, 2022

github-actions bot added the inactive-30d label Jun 11, 2022

GregoryKimball closed this as completed Jun 28, 2022

GregoryKimball changed the title ~~[FEA]~~ [FEA] Help with workflow that reaches column size limit Apr 18, 2023

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Help with workflow that reaches column size limit #10836

[FEA] Help with workflow that reaches column size limit #10836

pretzelpy commented May 12, 2022

jrhemstad commented May 12, 2022

github-actions bot commented Jun 11, 2022

GregoryKimball commented Jun 28, 2022

[FEA] Help with workflow that reaches column size limit #10836

[FEA] Help with workflow that reaches column size limit #10836

Comments

pretzelpy commented May 12, 2022

jrhemstad commented May 12, 2022

github-actions bot commented Jun 11, 2022

GregoryKimball commented Jun 28, 2022