Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Like does not work how we would like it to. #6431

Closed
revans2 opened this issue Aug 26, 2022 · 18 comments
Closed

[BUG] Like does not work how we would like it to. #6431

revans2 opened this issue Aug 26, 2022 · 18 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 26, 2022

Describe the bug
As a part of looking into filing #6430 I did some initial tests with like to get an idea of the performance. I ran into a situation where the query got the wrong answer, and another where it just hung.

Steps/Code to reproduce bug

scala> spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("str_id like \"%1_0%\"").count()
22/08/26 21:19:42 WARN GpuOverrides:
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> count(1) will run on GPU
    *Expression <Count> count(1) will run on GPU
  *Expression <Alias> count(1)#5L AS count#6L will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_count(1) will run on GPU
        *Expression <Count> count(1) will run on GPU
      *Exec <ProjectExec> will run on GPU
        *Exec <FilterExec> will run on GPU
          *Expression <Like> cast(id#0L as string) LIKE %1_0% will run on GPU
            *Expression <Cast> cast(id#0L as string) will run on GPU
          *Exec <RangeExec> will run on GPU

res0: Long = 528675719

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("str_id like \"%1_0%\"").count()
res2: Long = 68412970

68,412,970 != 528,675,719

(This was run on Spark 3.3.0)

I then tried to do some bisecting and figure out where the errors started.

spark.range(100000000L).selectExpr("CAST(id as STRING) as str_id").filter("str_id like \"%1_0%\"").count()

works, but when I tried 500000000L it hung, but this hang is not consistent.

Expected behavior
We get the same answer on the GPU and the CPU. It would be nice to see if #6430 fixes this issue, but it also scares me that there might be something wrong with the GPU regexp kernel. Just because of the hang.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 26, 2022
@revans2
Copy link
Collaborator Author

revans2 commented Aug 26, 2022

To make it even worse, when I split the processing up into smaller parts I get the same answer each time.

0 - 100000000L, 100000000L - 200000000L, 200000000L - 300000000L, 300000000L - 400000000L, 400000000L - 500000000L.

It looks like there is a problem with regular expressions if the input data gets to be too large.

@revans2
Copy link
Collaborator Author

revans2 commented Aug 26, 2022

With a single task thread, it looks like it gets the right answer too... Not sure what is happening here.

@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Aug 30, 2022
@andygrove andygrove self-assigned this Aug 31, 2022
@andygrove
Copy link
Contributor

andygrove commented Aug 31, 2022

I have reproduced this. Here are some observations so far.

This is the equivalent query using regexp directly (which is what LIKE is doing).

spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())

Time taken: 13308 ms                                                            
res15: Long = 146713377

Time taken: 15013 ms                                                            
res16: Long = 68412969

Time taken: 13830 ms                                                            
res17: Long = 68412967

We should be using non-capture groups here. This speeds things up significantly but results are still inconsistent.

spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(?:.|\n)*1(?:.|\n)0(?:.|\n)*')").count())

Time taken: 10644 ms                                                            
res18: Long = 68412970

Time taken: 10539 ms                                                            
res19: Long = 68412962

I am going to try and repro the issue directly with cuDF next.

@andygrove
Copy link
Contributor

I'm not sure if this info is helpful but I tried running on a consumer GPU (RTX 3080) and ran into memory alloc issues then a seg fault.

22/08/31 20:53:26 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 6333333536 bytes. Total RMM allocated is 645835008 bytes.
22/08/31 20:53:26 ERROR Executor: Exception in task 26.0 in stage 15.0 (TID 271)
java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-202-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:120: cudaErrorMemoryAllocation out of memory
	at ai.rapids.cudf.ColumnView.matchesRe(Native Method)
	at ai.rapids.cudf.ColumnView.matchesRe(ColumnView.java:3218)
#
#  SIGSEGV (0xb) at pc=0x00007f3dbc6a282b, pid=126163, tid=0x00007f3d84fef700
#
# JRE version: OpenJDK Runtime Environment (8.0_342-b07) (build 1.8.0_342-8u342-b07-0ubuntu1~20.04-b07)
# Java VM: OpenJDK 64-Bit Server VM (25.342-b07 mixed mode linux-amd64 )
# Problematic frame:
# C  [cudf7643486203306330059.so+0x299982b]  cudf::numeric_scalar<long>::~numeric_scalar()+0x9b

@revans2
Copy link
Collaborator Author

revans2 commented Sep 1, 2022

We can also switch over to using the new like support that CUDF added, but it still scares me very much that we are running into very serious issues on a not too complicated query. There are only something like 10 or 11 nodes in the regexp. That should not be that much memory. Why would we be running out of native memory? why are we trying to allocate almost 6 GiB of memory for something this small?

@andygrove
Copy link
Contributor

I can no longer reproduce this:

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4230 ms                                                             
res5: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4211 ms                                                             
res6: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4207 ms                                                             
res7: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4195 ms                                                             
res8: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4191 ms                                                             
res9: Long = 68412970

scala> spark.time(spark.range(1000000000L).selectExpr("CAST(id as STRING) as str_id").filter("regexp_like(str_id, '(.|\n)*1(.|\n)0(.|\n)*')").count())
Time taken: 4194 ms                                                             
res10: Long = 68412970

@revans2 Can you still reproduce this? I don't know if something changed in my environment or whether something got fixed in cuDF at this point.

@andygrove
Copy link
Contributor

This last test run was on Spark 3.2.2 not 3.3.0. I will re-test on Spark 3.3.0 as well.

@andygrove
Copy link
Contributor

Cannot reproduce with Spark 3.3.0 either

@andygrove
Copy link
Contributor

I can reproduce it in on my workstation with an RTX 6000 but not on my workstation with an RTX 3080. I will resume trying to reproduce this in a simple Java test.

@andygrove
Copy link
Contributor

I can also work around the issue by setting spark.rapids.sql.concurrentGpuTasks=1

@andygrove
Copy link
Contributor

I have given up on creating a simple repro case for now and switched to adding diagnostic code to the plugin, and this confirms that matchesRe is sometimes producing incorrect results. There does not seem to be any pattern to the errors, so I assume this suggests some form of memory corruption.

Examples from one run:

GpuRLike lhs=166666666 rows (2166666662 bytes) nulls=0
Index=20833484, input=187500150, CPU=true, GPU=false                (0 + 6) / 6]
Index=20833494, input=187500160, CPU=true, GPU=false
Index=20833504, input=187500170, CPU=true, GPU=false
Index=20833514, input=187500180, CPU=true, GPU=false

Examples from another run:

GpuRLike lhs=166666667 rows (2166666675 bytes) nulls=0
Index=0, input=666666666, CPU=false, GPU=true
Index=1, input=666666667, CPU=false, GPU=true
Index=2, input=666666668, CPU=false, GPU=true
Index=3, input=666666669, CPU=false, GPU=true
Index=4, input=666666670, CPU=false, GPU=true
Index=5, input=666666671, CPU=false, GPU=true

@revans2
Copy link
Collaborator Author

revans2 commented Sep 26, 2022

@sameerz This has got to be a blocker for 22.10, or we are going to have to disable a massive about of functionality until we can get this fixed. We use matchesRe in

  • Like
  • Cast
    • string to float
    • string to date
    • string to timestamp
  • CSV reader
  • JSON reader
  • GetTimestamp
  • UnixTimestamp
  • ToUnixTimestamp
  • ToTimestamp

and containsRe in

  • RLike
  • RegExpExtract

And I find it really unlikely that this is only happening in matchesRe and containsRe.

extractRe appears to be used in

  • casting from string to various date/time interval types.
  • RegExpExtract
  • SubstringIndex

@andygrove
Copy link
Contributor

@revans2 Do you think the diagnostic info here is enough for us to file an issue against cuDF? I was trying to get a simple repro in a Java test but have not been able to make that reproduce the same behavior.

@revans2
Copy link
Collaborator Author

revans2 commented Sep 26, 2022

@andygrove I think we need to pull them in at this point. Just so that they are aware of what is happening. We can let them decide if there is more that they need to help us out.

@andygrove
Copy link
Contributor

Another observation. If I leave spark-shell open for some time (40 minutes in this case) after running the query, I see:

scala> 22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 95)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 18)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 66)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 93)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 20)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 67)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 113)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 111)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 16)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 64)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 62)
22/09/26 21:00:41 ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 21)

@andygrove
Copy link
Contributor

Disregard the last message - I forgot that I am running with some debug code

@andygrove
Copy link
Contributor

I tested with the cuDF fix at rapidsai/cudf#11797 and it appears to resolve the issue.

@andygrove
Copy link
Contributor

I re-tested with the latest nightly build of cudf and confirmed that this issue is now resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

3 participants