[BUG] Like does not work how we would like it to. #6431

revans2 · 2022-08-26T21:25:48Z

Describe the bug
As a part of looking into filing #6430 I did some initial tests with like to get an idea of the performance. I ran into a situation where the query got the wrong answer, and another where it just hung.

Steps/Code to reproduce bug

scala> spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("str_id like \"%1_0%\"").count()
22/08/26 21:19:42 WARN GpuOverrides:
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#5L AS count#6L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      *Exec <ProjectExec> will run on GPU
        *Exec <FilterExec> will run on GPU
          *Expression <Like> cast(id#0L as string) LIKE %1_0% will run on GPU
            *Expression <Cast> cast(id#0L as string) will run on GPU
          *Exec <RangeExec> will run on GPU

res0: Long = 528675719

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("str_id like \"%1_0%\"").count()
res2: Long = 68412970

68,412,970 != 528,675,719

(This was run on Spark 3.3.0)

I then tried to do some bisecting and figure out where the errors started.

spark.range(100000000L).selectExpr("CAST(id as STRING) as str_id").filter("str_id like \"%1_0%\"").count()

works, but when I tried 500000000L it hung, but this hang is not consistent.

Expected behavior
We get the same answer on the GPU and the CPU. It would be nice to see if #6430 fixes this issue, but it also scares me that there might be something wrong with the GPU regexp kernel. Just because of the hang.

The text was updated successfully, but these errors were encountered:

revans2 · 2022-08-26T21:30:20Z

To make it even worse, when I split the processing up into smaller parts I get the same answer each time.

0 - 100000000L, 100000000L - 200000000L, 200000000L - 300000000L, 300000000L - 400000000L, 400000000L - 500000000L.

It looks like there is a problem with regular expressions if the input data gets to be too large.

revans2 · 2022-08-26T21:37:40Z

With a single task thread, it looks like it gets the right answer too... Not sure what is happening here.

andygrove · 2022-08-31T17:46:41Z

I have reproduced this. Here are some observations so far.

This is the equivalent query using regexp directly (which is what LIKE is doing).

spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())

Time taken: 13308 ms                                                            
res15: Long = 146713377

Time taken: 15013 ms                                                            
res16: Long = 68412969

Time taken: 13830 ms                                                            
res17: Long = 68412967

We should be using non-capture groups here. This speeds things up significantly but results are still inconsistent.

spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(?:.|\n)*1(?:.|\n)0(?:.|\n)*')").count())

Time taken: 10644 ms                                                            
res18: Long = 68412970

Time taken: 10539 ms                                                            
res19: Long = 68412962

I am going to try and repro the issue directly with cuDF next.

andygrove · 2022-08-31T20:56:04Z

I'm not sure if this info is helpful but I tried running on a consumer GPU (RTX 3080) and ran into memory alloc issues then a seg fault.

22/08/31 20:53:26 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 6333333536 bytes. Total RMM allocated is 645835008 bytes.
22/08/31 20:53:26 ERROR Executor: Exception in task 26.0 in stage 15.0 (TID 271)
java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-202-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:120: cudaErrorMemoryAllocation out of memory
	at ai.rapids.cudf.ColumnView.matchesRe(Native Method)
	at ai.rapids.cudf.ColumnView.matchesRe(ColumnView.java:3218)

#
#  SIGSEGV (0xb) at pc=0x00007f3dbc6a282b, pid=126163, tid=0x00007f3d84fef700
#
# JRE version: OpenJDK Runtime Environment (8.0_342-b07) (build 1.8.0_342-8u342-b07-0ubuntu1~20.04-b07)
# Java VM: OpenJDK 64-Bit Server VM (25.342-b07 mixed mode linux-amd64 )
# Problematic frame:
# C  [cudf7643486203306330059.so+0x299982b]  cudf::numeric_scalar<long>::~numeric_scalar()+0x9b

revans2 · 2022-09-01T19:38:38Z

We can also switch over to using the new like support that CUDF added, but it still scares me very much that we are running into very serious issues on a not too complicated query. There are only something like 10 or 11 nodes in the regexp. That should not be that much memory. Why would we be running out of native memory? why are we trying to allocate almost 6 GiB of memory for something this small?

andygrove · 2022-09-23T17:11:21Z

I can no longer reproduce this:

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4230 ms                                                             
res5: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4211 ms                                                             
res6: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4207 ms                                                             
res7: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4195 ms                                                             
res8: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4191 ms                                                             
res9: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4194 ms                                                             
res10: Long = 68412970

@revans2 Can you still reproduce this? I don't know if something changed in my environment or whether something got fixed in cuDF at this point.

andygrove · 2022-09-23T17:17:36Z

This last test run was on Spark 3.2.2 not 3.3.0. I will re-test on Spark 3.3.0 as well.

andygrove · 2022-09-23T17:31:06Z

Cannot reproduce with Spark 3.3.0 either

andygrove · 2022-09-23T18:27:48Z

I can reproduce it in on my workstation with an RTX 6000 but not on my workstation with an RTX 3080. I will resume trying to reproduce this in a simple Java test.

andygrove · 2022-09-23T18:55:26Z

I can also work around the issue by setting spark.rapids.sql.concurrentGpuTasks=1

andygrove · 2022-09-23T22:12:59Z

I have given up on creating a simple repro case for now and switched to adding diagnostic code to the plugin, and this confirms that matchesRe is sometimes producing incorrect results. There does not seem to be any pattern to the errors, so I assume this suggests some form of memory corruption.

Examples from one run:

GpuRLike lhs=166666666 rows (2166666662 bytes) nulls=0
Index=20833484, input=187500150, CPU=true, GPU=false                (0 + 6) / 6]
Index=20833494, input=187500160, CPU=true, GPU=false
Index=20833504, input=187500170, CPU=true, GPU=false
Index=20833514, input=187500180, CPU=true, GPU=false

Examples from another run:

GpuRLike lhs=166666667 rows (2166666675 bytes) nulls=0
Index=0, input=666666666, CPU=false, GPU=true
Index=1, input=666666667, CPU=false, GPU=true
Index=2, input=666666668, CPU=false, GPU=true
Index=3, input=666666669, CPU=false, GPU=true
Index=4, input=666666670, CPU=false, GPU=true
Index=5, input=666666671, CPU=false, GPU=true

revans2 · 2022-09-26T14:52:59Z

@sameerz This has got to be a blocker for 22.10, or we are going to have to disable a massive about of functionality until we can get this fixed. We use matchesRe in

Like
Cast
- string to float
- string to date
- string to timestamp
CSV reader
JSON reader
GetTimestamp
UnixTimestamp
ToUnixTimestamp
ToTimestamp

and containsRe in

RLike
RegExpExtract

And I find it really unlikely that this is only happening in matchesRe and containsRe.

extractRe appears to be used in

casting from string to various date/time interval types.
RegExpExtract
SubstringIndex

andygrove · 2022-09-26T15:22:22Z

@revans2 Do you think the diagnostic info here is enough for us to file an issue against cuDF? I was trying to get a simple repro in a Java test but have not been able to make that reproduce the same behavior.

revans2 · 2022-09-26T15:53:53Z

@andygrove I think we need to pull them in at this point. Just so that they are aware of what is happening. We can let them decide if there is more that they need to help us out.

andygrove · 2022-09-26T21:28:16Z

Another observation. If I leave spark-shell open for some time (40 minutes in this case) after running the query, I see:

scala> 22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 95)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 18)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 66)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 93)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 20)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 67)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 113)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 111)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 16)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 64)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 62)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 21)

andygrove · 2022-09-26T21:34:00Z

Disregard the last message - I forgot that I am running with some debug code

andygrove · 2022-09-27T21:19:38Z

I tested with the cuDF fix at rapidsai/cudf#11797 and it appears to resolve the issue.

andygrove · 2022-10-03T20:10:51Z

I re-tested with the latest nightly build of cudf and confirmed that this issue is now resolved.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 26, 2022

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Aug 30, 2022

andygrove self-assigned this Aug 31, 2022

sameerz added this to the Sep 5 - Sep 23 milestone Sep 6, 2022

revans2 mentioned this issue Sep 7, 2022

[FEA] avoid regexp capture groups automatically in some cases rapidsai/cudf#11663

Closed

andygrove mentioned this issue Sep 22, 2022

[BUG] CUDA error when casting large column vector from long to string #6598

Closed

andygrove mentioned this issue Sep 26, 2022

[BUG] Suspected memory corruption with regexp calls rapidsai/cudf#11768

Closed

sameerz modified the milestones: Sep 5 - Sep 23, Sep 26 - Oct 7 Sep 27, 2022

andygrove closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Like does not work how we would like it to. #6431

[BUG] Like does not work how we would like it to. #6431

revans2 commented Aug 26, 2022

revans2 commented Aug 26, 2022

revans2 commented Aug 26, 2022

andygrove commented Aug 31, 2022 •

edited

Loading

andygrove commented Aug 31, 2022

revans2 commented Sep 1, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

revans2 commented Sep 26, 2022

andygrove commented Sep 26, 2022

revans2 commented Sep 26, 2022

andygrove commented Sep 26, 2022

andygrove commented Sep 26, 2022

andygrove commented Sep 27, 2022

andygrove commented Oct 3, 2022

[BUG] Like does not work how we would like it to. #6431

[BUG] Like does not work how we would like it to. #6431

Comments

revans2 commented Aug 26, 2022

revans2 commented Aug 26, 2022

revans2 commented Aug 26, 2022

andygrove commented Aug 31, 2022 • edited Loading

andygrove commented Aug 31, 2022

revans2 commented Sep 1, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

andygrove commented Sep 23, 2022

revans2 commented Sep 26, 2022

andygrove commented Sep 26, 2022

revans2 commented Sep 26, 2022

andygrove commented Sep 26, 2022

andygrove commented Sep 26, 2022

andygrove commented Sep 27, 2022

andygrove commented Oct 3, 2022

andygrove commented Aug 31, 2022 •

edited

Loading