-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CUDA error when casting large column vector from long to string #6598
Comments
I actually get OOM errors instead with smaller column vectors:
|
My guess is that there are no guards/checks for overflow when creating a String column that is too large. We should fix this in CUDF, but I think it is still going to end up being an error until we have a way to split up batches from within an expression. #1501 There is no good solution to this except make your batch size smaller |
Thanks @revans2. That makes sense. I filed rapidsai/cudf#11743 |
When creating strings column, the size of each row is computed then sum up by prefix scan to find the total size of the output column. If you have too many numbers like above, the output size will exceed |
FYI cudf doesn't check such overflow: https://github.com/rapidsai/cudf/blob/d2414582d853fa32960a59cb052aaaf814b1153e/cpp/src/strings/convert/convert_integers.cu#L381 |
@revans2 @andygrove Checking overflow for this case in cudf (i.e., addressing this issue) would be difficult because we just can't check it accurately with the current cudf implementation. The overflow may result in a negative size (can be detected) but may not (and can't be detected). We can't check if there was overflow unless we use an extra 64 bits column to compute the output size. IMO, if we want to completely address this overflow check problem then we may need to do extra work in the JNI layer beyond cudf string APIs. |
I agree. We could start with something simple, but we are going to run into this problem all over the place. All numbers get larger when they are cast like this. And arrays/maps/etc also have similar issues. We probably need/want a follow on issue to deal with this generally when casting to a string. For now we can do some simple things. The largest number of digits that would be output by a string conversion of a long is 20.
So that means we are guaranteed to never overflow if there are 107,374,182 values or less in the column ( For now we could not worry about it if the number of rows is less than the above magic number, which ever one you feel best about. But if there are more, then we might want to write our own operator to see if it would overflow. Generally it would be doing a log10 of the abs value (sort of) because there are corner cases that get in the way. It might be faster to just write a kernel to do it ourselves. Possibly even just copy the code from CUDF that calculates the output size and update it to be a long, because that is what we need. |
Now with rapidsai/cudf#12180 merged, we can adopt it into libcudf string conversion APIs so we can detect overflow in string column creation and the example can fail clearly and loudly. |
It seems that this should already be resolved automatically due to rapidsai/cudf#12180. I'm running a test to check before closing this. |
Yes, the issue is resolved. We get a more meaningful exception:
So close this as resolved. |
Describe the bug
I was working on a repro case for #6431 and ran into a CUDA error when casting longs to strings.
Steps/Code to reproduce bug
Add this test in
spark-rapids-jni
project.Expected behavior
Should not fail in this way. I would understand getting an OOM error instead.
Environment details (please complete the following information)
Desktop.
Additional context
None
The text was updated successfully, but these errors were encountered: