-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Like does not work how we would like it to. #6431
Comments
To make it even worse, when I split the processing up into smaller parts I get the same answer each time. 0 - 100000000L, 100000000L - 200000000L, 200000000L - 300000000L, 300000000L - 400000000L, 400000000L - 500000000L. It looks like there is a problem with regular expressions if the input data gets to be too large. |
With a single task thread, it looks like it gets the right answer too... Not sure what is happening here. |
I have reproduced this. Here are some observations so far. This is the equivalent query using regexp directly (which is what LIKE is doing).
We should be using non-capture groups here. This speeds things up significantly but results are still inconsistent.
I am going to try and repro the issue directly with cuDF next. |
I'm not sure if this info is helpful but I tried running on a consumer GPU (RTX 3080) and ran into memory alloc issues then a seg fault.
|
We can also switch over to using the new like support that CUDF added, but it still scares me very much that we are running into very serious issues on a not too complicated query. There are only something like 10 or 11 nodes in the regexp. That should not be that much memory. Why would we be running out of native memory? why are we trying to allocate almost 6 GiB of memory for something this small? |
I can no longer reproduce this:
@revans2 Can you still reproduce this? I don't know if something changed in my environment or whether something got fixed in cuDF at this point. |
This last test run was on Spark 3.2.2 not 3.3.0. I will re-test on Spark 3.3.0 as well. |
Cannot reproduce with Spark 3.3.0 either |
I can reproduce it in on my workstation with an RTX 6000 but not on my workstation with an RTX 3080. I will resume trying to reproduce this in a simple Java test. |
I can also work around the issue by setting |
I have given up on creating a simple repro case for now and switched to adding diagnostic code to the plugin, and this confirms that Examples from one run:
Examples from another run:
|
@sameerz This has got to be a blocker for 22.10, or we are going to have to disable a massive about of functionality until we can get this fixed. We use
and
And I find it really unlikely that this is only happening in
|
@revans2 Do you think the diagnostic info here is enough for us to file an issue against cuDF? I was trying to get a simple repro in a Java test but have not been able to make that reproduce the same behavior. |
@andygrove I think we need to pull them in at this point. Just so that they are aware of what is happening. We can let them decide if there is more that they need to help us out. |
Another observation. If I leave spark-shell open for some time (40 minutes in this case) after running the query, I see:
|
Disregard the last message - I forgot that I am running with some debug code |
I tested with the cuDF fix at rapidsai/cudf#11797 and it appears to resolve the issue. |
I re-tested with the latest nightly build of cudf and confirmed that this issue is now resolved. |
Describe the bug
As a part of looking into filing #6430 I did some initial tests with like to get an idea of the performance. I ran into a situation where the query got the wrong answer, and another where it just hung.
Steps/Code to reproduce bug
68,412,970 != 528,675,719
(This was run on Spark 3.3.0)
I then tried to do some bisecting and figure out where the errors started.
works, but when I tried 500000000L it hung, but this hang is not consistent.
Expected behavior
We get the same answer on the GPU and the CPU. It would be nice to see if #6430 fixes this issue, but it also scares me that there might be something wrong with the GPU regexp kernel. Just because of the hang.
The text was updated successfully, but these errors were encountered: