-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] a simple query returns wrong results #6435
Comments
@ilaychen Could you send the details to spark-rapids-support [email protected]? |
Hi, I found the case where this issue happens. You can generate a csv file of ~1k rows with this kind of data:
Read it within a spark-rapids session with:
and count the dataframe.. you'll see that the number of rows in the actual file is different than the number of rows in the dataframe. |
@ilaychen I duplicated your sample data to 2000+ CSV rows(without header) and used latest 22.10 snapshot jar to test it.
It matches the sample CSV file:
Could you share below by email to us([email protected])?
|
Thanks @ilaychen for sharing the sample data and I can reproduce the issue now.
GPU run:
GPU run:
|
My pleasure! @viadea
The schema that is mentioned above still applies 😃 |
I filed rapidsai/cudf#11948 in CUDF for the issue as it is their bug. I'll also try to take a look at their code to see what I can come up with. |
It looks like there are two issues happening here. One of them is that CUDF is returning the wrong number of rows. It gets confused when it sees the "" in the row separator logic. The second one is that CUDF does not support escape characters. #129 instead it only supports the pandas default of double quotes to escape a single quote. i.e. |
the CUDF issue rapidsai/cudf#11948 is to fix it. |
Describe the bug
I'm running a simple query on both spark-rapids and spark 3.1.3 (on GCP's Dataproc clusters), and I'm getting different results.
The query I'm running (on a ~15TB dataset) is:
spark.sql("SELECT cntry_code, COUNT(cntry_code) as c from locations_ GROUP BY cntry_code sort by c DESC")
The results for pure spark 3.1.3 are:
the results for spark-rapids are:
The data i'm reading is CSV type, I'll try to figure out a way to share the dataset if that's important.
Steps/Code to reproduce bug
Start two Dataproc clusters, the first one as mentioned in here, the second one should be pure spark 3.1.3 .
In both clusters perform a similar spark sql query, such as:
spark.sql("SELECT cntry_code, COUNT(cntry_code) as c from locations_ GROUP BY cntry_code sort by c DESC")
and check the results.
Expected behavior
the outputs should be exactly the same.
Environment details (please complete the following information)
for spark-rapids:
for spark 3.1.3:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: