[BUG] CUDA error when casting large column vector from long to string #6598

andygrove · 2022-09-22T14:57:31Z

Describe the bug
I was working on a repro case for #6431 and ran into a CUDA error when casting longs to strings.

ai.rapids.cudf.CudaFatalException: for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
	at ai.rapids.cudf.ColumnView.castTo(Native Method)
	at ai.rapids.cudf.ColumnView.castTo(ColumnView.java:1876)
	at ai.rapids.cudf.ColumnVector.castTo(ColumnVector.java:790)
	at com.nvidia.spark.rapids.jni.CastStringsTest.castLongToString(CastStringsTest.java:43)

Steps/Code to reproduce bug

Add this test in spark-rapids-jni project.

  @Test
  void castLongToString() {
    int n = 1_000_000_000;
    long array[] = new long[n];
    for (int i = 0; i < n; i++) {
      array[i] = i;
    }
    try (ColumnVector cv = ColumnVector.fromLongs(array);
        ColumnVector cv2 = cv.castTo(DType.STRING)) {
      // success
    }
  }

Expected behavior
Should not fail in this way. I would understand getting an OOM error instead.

Environment details (please complete the following information)
Desktop.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:17:00.0 Off |                  Off |
| 44%   67C    P2   110W / 260W |    970MiB / 24220MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Additional context
None

The text was updated successfully, but these errors were encountered:

andygrove · 2022-09-22T15:25:25Z

I actually get OOM errors instead with smaller column vectors:

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/andygrove/git/spark-rapids-jni/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
        at ai.rapids.cudf.ColumnView.castTo(Native Method)

revans2 · 2022-09-22T15:33:10Z

My guess is that there are no guards/checks for overflow when creating a String column that is too large. We should fix this in CUDF, but I think it is still going to end up being an error until we have a way to split up batches from within an expression. #1501 There is no good solution to this except make your batch size smaller

andygrove · 2022-09-22T16:03:03Z

Thanks @revans2. That makes sense. I filed rapidsai/cudf#11743

ttnghia · 2022-09-23T22:37:36Z

When creating strings column, the size of each row is computed then sum up by prefix scan to find the total size of the output column. If you have too many numbers like above, the output size will exceed INT_MAX, which will overflow and will be wrapped around into a negative number. Such negative size will lead to the error reported above. You will get OOM error only if you don't have overflow but instead have a valid, very large output size.

ttnghia · 2022-09-23T22:40:47Z

FYI cudf doesn't check such overflow: https://github.com/rapidsai/cudf/blob/d2414582d853fa32960a59cb052aaaf814b1153e/cpp/src/strings/convert/convert_integers.cu#L381

ttnghia · 2022-09-23T23:01:01Z

@revans2 @andygrove Checking overflow for this case in cudf (i.e., addressing this issue) would be difficult because we just can't check it accurately with the current cudf implementation. The overflow may result in a negative size (can be detected) but may not (and can't be detected). We can't check if there was overflow unless we use an extra 64 bits column to compute the output size.

IMO, if we want to completely address this overflow check problem then we may need to do extra work in the JNI layer beyond cudf string APIs.

revans2 · 2022-09-26T14:32:13Z

IMO, if we want to completely address this overflow check problem then we may need to do extra work in the JNI layer beyond cudf string APIs.

I agree. We could start with something simple, but we are going to run into this problem all over the place. All numbers get larger when they are cast like this. And arrays/maps/etc also have similar issues. We probably need/want a follow on issue to deal with this generally when casting to a string.

For now we can do some simple things.

The largest number of digits that would be output by a string conversion of a long is 20.

scala> Long.MinValue.toString.length
res1: Int = 20

So that means we are guaranteed to never overflow if there are 107,374,182 values or less in the column (Int.MaxValue/20). If we want to rely on overflow checking with just a negative size number that value increases to 214,748,364.

For now we could not worry about it if the number of rows is less than the above magic number, which ever one you feel best about. But if there are more, then we might want to write our own operator to see if it would overflow. Generally it would be doing a log10 of the abs value (sort of) because there are corner cases that get in the way. It might be faster to just write a kernel to do it ourselves. Possibly even just copy the code from CUDF that calculates the output size and update it to be a long, because that is what we need.

ttnghia · 2022-12-09T20:04:29Z

Now with rapidsai/cudf#12180 merged, we can adopt it into libcudf string conversion APIs so we can detect overflow in string column creation and the example can fail clearly and loudly.

ttnghia · 2023-02-13T21:14:44Z

It seems that this should already be resolved automatically due to rapidsai/cudf#12180. I'm running a test to check before closing this.

ttnghia · 2023-02-13T21:45:55Z

Yes, the issue is resolved. We get a more meaningful exception:

[ERROR] castLongToString  Time elapsed: 7.847 s  <<< ERROR!
ai.rapids.cudf.CudfException: CUDF failure at: 
cudf/cpp/include/cudf/strings/detail/strings_children.cuh:80: Size of output exceeds column size limit
	at ai.rapids.cudf.ColumnView.castTo(Native Method)
	at ai.rapids.cudf.ColumnView.castTo(ColumnView.java:1876)
	at ai.rapids.cudf.ColumnVector.castTo(ColumnVector.java:790)
	at com.nvidia.spark.rapids.jni.CastStringsTest.castLongToString(CastStringsTest.java:44)

So close this as resolved.

andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 22, 2022

ttnghia mentioned this issue Sep 23, 2022

[BUG] Misleading error when casting huge arrays from long to string rapidsai/cudf#11743

Closed

sameerz added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Sep 27, 2022

revans2 mentioned this issue Sep 27, 2022

[BUG] casting to string and list, and concat can cause overflow issues #6636

Closed

mattahrens assigned ttnghia Feb 10, 2023

ttnghia closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CUDA error when casting large column vector from long to string #6598

[BUG] CUDA error when casting large column vector from long to string #6598

andygrove commented Sep 22, 2022

andygrove commented Sep 22, 2022

revans2 commented Sep 22, 2022

andygrove commented Sep 22, 2022

ttnghia commented Sep 23, 2022 •

edited

Loading

ttnghia commented Sep 23, 2022

ttnghia commented Sep 23, 2022

revans2 commented Sep 26, 2022

ttnghia commented Dec 9, 2022

ttnghia commented Feb 13, 2023

ttnghia commented Feb 13, 2023 •

edited

Loading

[BUG] CUDA error when casting large column vector from long to string #6598

[BUG] CUDA error when casting large column vector from long to string #6598

Comments

andygrove commented Sep 22, 2022

andygrove commented Sep 22, 2022

revans2 commented Sep 22, 2022

andygrove commented Sep 22, 2022

ttnghia commented Sep 23, 2022 • edited Loading

ttnghia commented Sep 23, 2022

ttnghia commented Sep 23, 2022

revans2 commented Sep 26, 2022

ttnghia commented Dec 9, 2022

ttnghia commented Feb 13, 2023

ttnghia commented Feb 13, 2023 • edited Loading

ttnghia commented Sep 23, 2022 •

edited

Loading

ttnghia commented Feb 13, 2023 •

edited

Loading