Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDA error when casting large column vector from long to string #6598

Closed
andygrove opened this issue Sep 22, 2022 · 10 comments
Closed
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@andygrove
Copy link
Contributor

Describe the bug
I was working on a repro case for #6431 and ran into a CUDA error when casting longs to strings.

ai.rapids.cudf.CudaFatalException: for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
	at ai.rapids.cudf.ColumnView.castTo(Native Method)
	at ai.rapids.cudf.ColumnView.castTo(ColumnView.java:1876)
	at ai.rapids.cudf.ColumnVector.castTo(ColumnVector.java:790)
	at com.nvidia.spark.rapids.jni.CastStringsTest.castLongToString(CastStringsTest.java:43)

Steps/Code to reproduce bug

Add this test in spark-rapids-jni project.

  @Test
  void castLongToString() {
    int n = 1_000_000_000;
    long array[] = new long[n];
    for (int i = 0; i < n; i++) {
      array[i] = i;
    }
    try (ColumnVector cv = ColumnVector.fromLongs(array);
        ColumnVector cv2 = cv.castTo(DType.STRING)) {
      // success
    }
  }

Expected behavior
Should not fail in this way. I would understand getting an OOM error instead.

Environment details (please complete the following information)
Desktop.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:17:00.0 Off |                  Off |
| 44%   67C    P2   110W / 260W |    970MiB / 24220MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Additional context
None

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 22, 2022
@andygrove
Copy link
Contributor Author

I actually get OOM errors instead with smaller column vectors:

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/andygrove/git/spark-rapids-jni/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
        at ai.rapids.cudf.ColumnView.castTo(Native Method)

@revans2
Copy link
Collaborator

revans2 commented Sep 22, 2022

My guess is that there are no guards/checks for overflow when creating a String column that is too large. We should fix this in CUDF, but I think it is still going to end up being an error until we have a way to split up batches from within an expression. #1501 There is no good solution to this except make your batch size smaller

@andygrove
Copy link
Contributor Author

Thanks @revans2. That makes sense. I filed rapidsai/cudf#11743

@ttnghia
Copy link
Collaborator

ttnghia commented Sep 23, 2022

When creating strings column, the size of each row is computed then sum up by prefix scan to find the total size of the output column. If you have too many numbers like above, the output size will exceed INT_MAX, which will overflow and will be wrapped around into a negative number. Such negative size will lead to the error reported above. You will get OOM error only if you don't have overflow but instead have a valid, very large output size.

@ttnghia
Copy link
Collaborator

ttnghia commented Sep 23, 2022

@ttnghia
Copy link
Collaborator

ttnghia commented Sep 23, 2022

@revans2 @andygrove Checking overflow for this case in cudf (i.e., addressing this issue) would be difficult because we just can't check it accurately with the current cudf implementation. The overflow may result in a negative size (can be detected) but may not (and can't be detected). We can't check if there was overflow unless we use an extra 64 bits column to compute the output size.

IMO, if we want to completely address this overflow check problem then we may need to do extra work in the JNI layer beyond cudf string APIs.

@revans2
Copy link
Collaborator

revans2 commented Sep 26, 2022

IMO, if we want to completely address this overflow check problem then we may need to do extra work in the JNI layer beyond cudf string APIs.

I agree. We could start with something simple, but we are going to run into this problem all over the place. All numbers get larger when they are cast like this. And arrays/maps/etc also have similar issues. We probably need/want a follow on issue to deal with this generally when casting to a string.

For now we can do some simple things.

The largest number of digits that would be output by a string conversion of a long is 20.

scala> Long.MinValue.toString.length
res1: Int = 20

So that means we are guaranteed to never overflow if there are 107,374,182 values or less in the column (Int.MaxValue/20). If we want to rely on overflow checking with just a negative size number that value increases to 214,748,364.

For now we could not worry about it if the number of rows is less than the above magic number, which ever one you feel best about. But if there are more, then we might want to write our own operator to see if it would overflow. Generally it would be doing a log10 of the abs value (sort of) because there are corner cases that get in the way. It might be faster to just write a kernel to do it ourselves. Possibly even just copy the code from CUDF that calculates the output size and update it to be a long, because that is what we need.

@sameerz sameerz added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Sep 27, 2022
@ttnghia
Copy link
Collaborator

ttnghia commented Dec 9, 2022

Now with rapidsai/cudf#12180 merged, we can adopt it into libcudf string conversion APIs so we can detect overflow in string column creation and the example can fail clearly and loudly.

@ttnghia
Copy link
Collaborator

ttnghia commented Feb 13, 2023

It seems that this should already be resolved automatically due to rapidsai/cudf#12180. I'm running a test to check before closing this.

@ttnghia
Copy link
Collaborator

ttnghia commented Feb 13, 2023

Yes, the issue is resolved. We get a more meaningful exception:

[ERROR] castLongToString  Time elapsed: 7.847 s  <<< ERROR!
ai.rapids.cudf.CudfException: CUDF failure at: 
cudf/cpp/include/cudf/strings/detail/strings_children.cuh:80: Size of output exceeds column size limit
	at ai.rapids.cudf.ColumnView.castTo(Native Method)
	at ai.rapids.cudf.ColumnView.castTo(ColumnView.java:1876)
	at ai.rapids.cudf.ColumnVector.castTo(ColumnVector.java:790)
	at com.nvidia.spark.rapids.jni.CastStringsTest.castLongToString(CastStringsTest.java:44)

So close this as resolved.

@ttnghia ttnghia closed this as completed Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

4 participants