-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Help with workflow that reaches column size limit #10836
Comments
I believe you are already aware of the issue. As you say, the number of elements in a column is limited to The solution is to use dask-cudf which will partition your input data into smaller pieces, each of which will fit within the size limit. dask-cudf handles the orchestration of executing operations across all of your partitions. The benefit of using dask-cudf is that the same code that will help you overcome the 2GB size limit today will enable you to easily scale up to 20TB. |
This issue has been labeled |
For more information see #3958 |
I want to use cudf to load multiple large parquets into python workflow. Once combined, I need be able to melt, pivot, reshape, join, the combined gdfs. The dataset's I am working has customer name (string field with 10-80 characters per row, and my combined dataset is already 500m rows long and will get longer.
Below is the runtime error I am getting when trying to melt. The gdf i am trying to melt that throws the following error is only 2.2M rows.
""Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1072-aws x86_64v)
(rapids-22.04) ubuntu@ip-10-195-51-131:~/holding$ python3 cudf_test.py
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/column/column.py", line 2401, in concat_columns
col = libcudf.concat.concat_columns(objs)
File "cudf/_lib/concat.pyx", line 37, in cudf._lib.concat.concat_columns
File "cudf/_lib/concat.pyx", line 41, in cudf._lib.concat.concat_columns
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/copying/concatenate.cu:391: Total number of concatena ted chars exceeds size_type range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "cudf_test.py", line 71, in
df_fpl_fetch = cudf_melt(frame=df_fpl_fetch, id_vars=index_col, var_name='sql_field',value_name='amount')
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 556, in melt
mdata = {col: _tile(frame[col], K) for col in id_vars}
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 556, in
mdata = {col: _tile(frame[col], K) for col in id_vars}
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/reshape.py", line 551, in _tile
return cudf.Series._concat(objs=series_list, index=None)
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/series.py", line 1295, in _conca t
col = concat_columns([o._column for o in objs])
File "/home/ubuntu/anaconda3/envs/rapids-22.04/lib/python3.8/site-packages/cudf/core/column/column.py", line 2404, in concat_columns
raise OverflowError(
OverflowError: total size of output is too large for a cudf column ""
Describe the solution you'd like
Ideally strings could be temporarily converted to a unique int64, and the pairing of each int64 alias to string value is maintained in 2 col frame. When I'm finished manipulating my dataset, I could do a join/row update to replace the int64 alias with the actual string that was originally in my dataset. Then could be a last step before saving transformed gdf to parquet or inserting to db table.
Describe alternatives you've considered
I am struggling to understand the issue, all I know if that cudf has a 2B limit, and that includes unique string characters. Even I if can find a workaround for this long string field, my dataset will be longer than 2B row with full run. so I probably also need to understand whether handling this long string column will be meaningless if I run into a more fundamental gdf limit on more standard int/float columns.
Additional context
I need a solution that allows me to use EC2 and python to handle large datasets, cudf has worked wonders in the past, but it was a 4 col time series gdf.... But if pandas can outperform a large dataset melt, then the overhead of using gpu and unique cudf syntax isn't a solution. :(
The text was updated successfully, but these errors were encountered: