Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Limitation on amount of Strings that can be loaded into the memory #10976

Closed
mlahir1 opened this issue May 25, 2022 · 2 comments
Closed
Labels
feature request New feature or request wontfix This will not be worked on

Comments

@mlahir1
Copy link

mlahir1 commented May 25, 2022

It seems like there is a limitation of around 2B total chars that can be present in a string column even though we don't max out on the GPU memory.

Here is a repro for it.

import cudf
import cupy as cp
df =cudf.DataFrame()
df['a'] = cp.random.randint(0,999,500_000_000)
df = df.to_pandas()
df['b'] = 'Hi'
df['c'] = df.b+df.a.astype('str')
df.to_parquet('test-str.pq')

import cudf
df = cudf.read_parquet('test-str.pq', columns=['a'])

Reading the int64 col with 0.5B rows works just fine. Memory print: (8bytes x 5,000,000,000)

But when i try to read the same number of columns with string of max 5 chars each (max of 2.5B chars) Memory print: (5 bytes x 5,000,000,000) which is less than above

import cudf
df = cudf.read_parquet('test-str.pq', columns=['c'])
df.dtypes, df.head()

cudf throws me an exception.
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/include/cudf/strings/detail/strings_column_factories.cuh:86: total size of strings is too large for cudf column

@mlahir1 mlahir1 added Needs Triage Need team to review and classify feature request New feature or request labels May 25, 2022
@vyasr
Copy link
Contributor

vyasr commented May 25, 2022

This is expected. A strings column cannot contain more than INT_MAX=2^31 characters. See #3958 for a longer discussion. The usual solution for this problem in cudf is to use dask, which will handle chunking that column into multiple pieces for you.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added wontfix This will not be worked on and removed Needs Triage Need team to review and classify labels Jun 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants