Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "RuntimeError: Total number of concatenated rows exceeds size_type range" for 2.2 million row series #8748

Closed
marco-ve opened this issue Jul 15, 2021 · 4 comments · Fixed by #8760
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@marco-ve
Copy link

Describe the bug
A cudf.Series containing lists will throw a RuntimeError if one tries to access a certain number of rows, depending on the length of the contained lists.

Steps/Code to reproduce bug

gsr = cudf.Series([arr for arr in np.random.uniform(low=-0.2, high=0.2, size=(2000000,512))])
gsr

This will display just fine (2 million rows).

gsr = cudf.Series([arr for arr in np.random.uniform(low=-0.2, high=0.2, size=(2200000,512))])
gsr

2.2 million rows. This will throw a "RuntimeError: cuDF failure at: ../src/copying/concatenate.cu:359: Total number of concatenated rows exceeds size_type range". However, accesssing an element directly by index will work fine. Also, accessing elements [0:60] will work, but [0:61] will throw the same exception.

gsr = cudf.Series([arr for arr in np.random.uniform(low=-0.2, high=0.2, size=(2200000,256))])
gsr

This will display just fine (2.2 million rows, but shorter lists).

Expected behavior
Expected behavior would be for the series to access rows without throwing an error, independent of the dimensions of the object contained, same as in e.g. a pandas series.

Environment overview (please complete the following information)
Environment is a jupyterlab notebook hosted in the rapidsai docker container, using an RTX3090 with 24GB VRAM. cuml version is 21.06.01+2.g101fc0fda4.

@marco-ve marco-ve added Needs Triage Need team to review and classify bug Something isn't working labels Jul 15, 2021
@shwina shwina removed the Needs Triage Need team to review and classify label Jul 15, 2021
@shwina
Copy link
Contributor

shwina commented Jul 15, 2021

List columns are limited by number of elements (not rows). The total number of elements in a list column cannot exceed INT32_MAX. We should raise a better error message here.

Internally, a list column holds a column composed of all its elements, i.e., the following list column:

[1, 2, 3]
[4, 5]
[6, None, 8]

internally holds the column [1, 2, 3, 4, 5, 6, None, 8]. The size of that column cannot exceed INT32_MAX.

@shwina shwina added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 15, 2021
@marco-ve
Copy link
Author

marco-ve commented Jul 15, 2021

Thanks for the clarification. But isn't the max value of an int32 around 2 billion, whereas 2,200,000 * 512 is only 1.1 billion?

Also why does accessing rows [0:60] not throw an error, but [0:61] does?

@shwina shwina removed the libcudf Affects libcudf (C++/CUDA) code. label Jul 15, 2021
@shwina
Copy link
Contributor

shwina commented Jul 15, 2021

Thanks -- you're right, that is a bug. Investigating.

@shwina
Copy link
Contributor

shwina commented Jul 15, 2021

Slightly simpler, more revealing reproducer (although I still haven't root-caused this):

s = cudf.Series([arr for arr in np.ones((2_200_000, 512))]) # a list column of 2_200_000 rows and 512 elements per row
top = s.iloc[:31] # slice of first 31 rows
bottom = s.iloc[31:62]  # slice of rows 31-62
cudf.concat([top, bottom])  # errors
cudf.concat([top.copy(), bottom.copy()])  # deep copy before concat, no error

cc: @nvdbaranec

rapids-bot bot pushed a commit that referenced this issue Jul 20, 2021
…ecking. (#8760)

Fixes  #8748

Note:  `concatenate_tests.cpp` was renamed to `concatenate_tests.cu` because of the addition of some thrust calls.  

Existing overflow tests moved to `OverflowTest` section.  New tests specific to this PR are:

`Overflowtest.Presliced`
`OverflowTest.BigColumnsSmallSlices`

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Michael Wang (https://github.com/isVoid)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #8760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants