Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Help with workflow that reaches column size limit #10836

Closed
pretzelpy opened this issue May 12, 2022 · 3 comments
Closed

[FEA] Help with workflow that reaches column size limit #10836

pretzelpy opened this issue May 12, 2022 · 3 comments
Labels
feature request New feature or request

Comments

@pretzelpy
Copy link

I want to use cudf to load multiple large parquets into python workflow. Once combined, I need be able to melt, pivot, reshape, join, the combined gdfs. The dataset's I am working has customer name (string field with 10-80 characters per row, and my combined dataset is already 500m rows long and will get longer.

Below is the runtime error I am getting when trying to melt. The gdf i am trying to melt that throws the following error is only 2.2M rows.

""Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1072-aws x86_64v)
(rapids-22.04) ubuntu@ip-10-195-51-131:~/holding$ python3 cudf_test.py

Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/column/column.py", line 2401, in concat_columns
col = libcudf.concat.concat_columns(objs)
File "cudf/_lib/concat.pyx", line 37, in cudf._lib.concat.concat_columns
File "cudf/_lib/concat.pyx", line 41, in cudf._lib.concat.concat_columns
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/copying/concatenate.cu:391: Total number of concatena ted chars exceeds size_type range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "cudf_test.py", line 71, in
df_fpl_fetch = cudf_melt(frame=df_fpl_fetch, id_vars=index_col, var_name='sql_field',value_name='amount')
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 556, in melt
mdata = {col: _tile(frame[col], K) for col in id_vars}
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 556, in
mdata = {col: _tile(frame[col], K) for col in id_vars}
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 551, in _tile
return cudf.Series._concat(objs=series_list, index=None)
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/series.py", line 1295, in _conca t
col = concat_columns([o._column for o in objs])
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/column/column.py", line 2404, in concat_columns
raise OverflowError(
OverflowError: total size of output is too large for a cudf column ""

Describe the solution you'd like
Ideally strings could be temporarily converted to a unique int64, and the pairing of each int64 alias to string value is maintained in 2 col frame. When I'm finished manipulating my dataset, I could do a join/row update to replace the int64 alias with the actual string that was originally in my dataset. Then could be a last step before saving transformed gdf to parquet or inserting to db table.

Describe alternatives you've considered
I am struggling to understand the issue, all I know if that cudf has a 2B limit, and that includes unique string characters. Even I if can find a workaround for this long string field, my dataset will be longer than 2B row with full run. so I probably also need to understand whether handling this long string column will be meaningless if I run into a more fundamental gdf limit on more standard int/float columns.

Additional context
I need a solution that allows me to use EC2 and python to handle large datasets, cudf has worked wonders in the past, but it was a 4 col time series gdf.... But if pandas can outperform a large dataset melt, then the overhead of using gpu and unique cudf syntax isn't a solution. :(

@pretzelpy pretzelpy added Needs Triage Need team to review and classify feature request New feature or request labels May 12, 2022
@jrhemstad
Copy link
Contributor

I am struggling to understand the issue, all I know if that cudf has a 2B limit, and that includes unique string characters. Even I if can find a workaround for this long string field, my dataset will be longer than 2B row with full run.

I believe you are already aware of the issue. As you say, the number of elements in a column is limited to 2^31 - 1. A column of strings is represented internally with a column of characters. This means a column with 1 string of 2^31 characters would exceed the limit or a column of 2^31 strings with 1 character each and everything in between.

The solution is to use dask-cudf which will partition your input data into smaller pieces, each of which will fit within the size limit. dask-cudf handles the orchestration of executing operations across all of your partitions. The benefit of using dask-cudf is that the same code that will help you overcome the 2GB size limit today will enable you to easily scale up to 20TB.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball
Copy link
Contributor

For more information see #3958

@GregoryKimball GregoryKimball changed the title [FEA] [FEA] Help with workflow that reaches column size limit Apr 18, 2023
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants