[BUG] a simple query returns wrong results #6435

ilaychen · 2022-08-28T12:11:14Z

Describe the bug
I'm running a simple query on both spark-rapids and spark 3.1.3 (on GCP's Dataproc clusters), and I'm getting different results.
The query I'm running (on a ~15TB dataset) is:
spark.sql("SELECT cntry_code, COUNT(cntry_code) as c from locations_ GROUP BY cntry_code sort by c DESC")

The results for pure spark 3.1.3 are:

+--------------+----------+
|    cntry_code|         c|
+--------------+----------+
|            US| 108174267|
|            GB|  28301655|
|            DE|  21627123|
|            FR|  12282801|
|            CA|   8104623|
|            AU|   7106091|
|            IT|   6912796|
|            ES|   5609006|
+--------------+----------+

the results for spark-rapids are:

+--------------+----------+
|    cntry_code|         c|
+--------------+----------+
|            US| 108000877|
|            GB|  28256306|
|            DE|  21592682|
|            FR|  12262796|
|            CA|   8091397|
|            AU|   7094689|
|            IT|   6901487|
|            ES|   5599871|
+------------+------------+

The data i'm reading is CSV type, I'll try to figure out a way to share the dataset if that's important.

Steps/Code to reproduce bug
Start two Dataproc clusters, the first one as mentioned in here, the second one should be pure spark 3.1.3 .
In both clusters perform a similar spark sql query, such as:
spark.sql("SELECT cntry_code, COUNT(cntry_code) as c from locations_ GROUP BY cntry_code sort by c DESC")
and check the results.

Expected behavior
the outputs should be exactly the same.

Environment details (please complete the following information)

Environment location: GCP
Spark configuration settings related to the issue
for spark-rapids:

conf = SparkConf().setAppName("Locations")
conf.set('spark.rapids.sql.explain', 'ALL')
conf.set("spark.executor.instances", "2")
conf.set("spark.executor.cores", "1")
conf.set("spark.task.cpus", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
conf.set("spark.executor.memory", "16g")
conf.set("spark.rapids.memory.pinnedPool.size", "2G")
conf.set("spark.executor.memoryOverhead", "2G")
conf.set("spark.executor.extraJavaOptions", "-Dai.rapids.cudf.prefer-pinned=true")
conf.set("spark.locality.wait", "0s")
conf.set("spark.sql.files.maxPartitionBytes", "2G")
conf.set("spark.sql.broadcastTimeout", "3000")
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.task.resource.gpu.amount", "0.142")
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.rapids.sql.hasNans", "false")
conf.set("spark.rapids.sql.regexp.enabled","true")
conf.set('spark.rapids.sql.variableFloatAgg.enabled', 'true')
conf.set('spark.rapids.sql.csv.read.double.enabled', 'true')
conf.set('spark.rapids.sql.exec.CollectLimitExec', 'true')

for spark 3.1.3:

conf = SparkConf().setAppName("Locations")
conf.set("spark.executor.instances", "2")
conf.set("spark.executor.cores", "1")
conf.set("spark.task.cpus", "1")
conf.set("spark.executor.memory", "16g")
conf.set("spark.executor.memoryOverhead", "2G")
conf.set("spark.sql.files.maxPartitionBytes", "2G")
conf.set("spark.sql.broadcastTimeout", "3000")

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

viadea · 2022-08-30T20:37:41Z

@ilaychen Could you send the details to spark-rapids-support [email protected]?
Such as the Spark RAPIDS version you are using and the sample dataset which can reproduce this issue?

ilaychen · 2022-09-08T18:04:46Z

Hi, I found the case where this issue happens.
It seems like spark-rapids returns wrong results even for a simple df.count() after processing the sample of data that is mentioned below.
This issue doesn't appear for very small files (less than 1k rows)

You can generate a csv file of ~1k rows with this kind of data:
(try to use these rows, and generate ~1k similar rows)

3453433564704482365,7622 S Handy Street,"",Silva,85468,NY,United States,US
1641493545665433335,241 Inlusive Road,Flat 30Trafalgar Court,Manchester,M16 8JW,"",United Kingdom,GB
1297115076542454494,2557 Latte Blvd,#402,Kansas City,64108,MO,United States,US
2119952246048784100,42 North Street,"",Sand-on-Sea,ZZ2 5HU,"",United Kingdom,GB
2186678475639058807,329 Johannah Way,"",Bridger,07507,NJ,United States,US
1730422379578793088,25 Yore Rd,"",MIAM,M9M 1W5,ON,Canada,CA

Read it within a spark-rapids session with:

customSchema = StructType([
  StructField("id", StringType(), True),
  StructField("addr_1", StringType(), True),
  StructField("addr_2", StringType(), True),
  StructField("city", StringType(), True),
  StructField("zip", StringType(), True),
  StructField("state", StringType(), True),
  StructField("cntry", StringType(), True),
  StructField("cntry_code", StringType(), True)]
)
df = spark.read.csv(path, schema=customSchema)

and count the dataframe.. you'll see that the number of rows in the actual file is different than the number of rows in the dataframe.
Another way to see that the processing has a bug - is to try to read id 1730422379578793088.. spark-rapids can't read it
spark.sql("SELECT * from df_tmpView where cust_id = '1730422379578793088'").count()

viadea · 2022-09-08T21:35:11Z

@ilaychen I duplicated your sample data to 2000+ CSV rows(without header) and used latest 22.10 snapshot jar to test it.
And it worked fine for me:

from pyspark.sql.types import *

customSchema = StructType([
  StructField("id", StringType(), True),
  StructField("addr_1", StringType(), True),
  StructField("addr_2", StringType(), True),
  StructField("city", StringType(), True),
  StructField("zip", StringType(), True),
  StructField("state", StringType(), True),
  StructField("cntry", StringType(), True),
  StructField("cntry_code", StringType(), True)]
)

path = "/home/xxx/data/xxx/samplecsv/"
df = spark.read.csv(path, schema=customSchema)
df.count()

2394

It matches the sample CSV file:

$  wc -l a.csv
2394 a.csv

Could you share below by email to us([email protected])?

Sample data(maybe 1k+ rows) which can reproduce this issue.
What is the exact version of your Spark RAPIDS jar?
What are the detailed spark and spark rapids configs you are using? Maybe the whole spark-defaults.conf used

viadea · 2022-10-19T18:40:09Z

Thanks @ilaychen for sharing the sample data and I can reproduce the issue now.
The key to reproduce is if there is a value with "", then it will stop there.
For example, if one column is:

abc""

GPU run:

>>> df.count()
12

GPU run:

>>> spark.conf.set("spark.rapids.sql.enabled","false")
>>> df.count()
14

ilaychen · 2022-10-19T19:55:58Z

My pleasure! @viadea
Adding the example csv file that produces this error:

134324937434,#1991 N Grayhawk,"",Menlo Park,89025,AB,United States,US
208564744937,"63,trevion Way","",st Lothian,h7f4h8,"",United Kingdom,GB
132709376823,16 Oakland PARK RD,"",ring,l1w1e4,South,Canada,CA
224867848652,7 kingwell Court,"",United,s7jd9,South United,United Kingdom,GB
169636884295,30 cartuja Road,"",Halifax,L0R 9p2,ON,Canada,CA
859473321609,Street,"",Manchester,92220,OR,United States,US
141096112545,99 rue des,"",Australia,jsd9je,"",France,FR
160397658930,5 Rise,"",walligshngton,RY6 8LT,FORT,United Kingdom,GB
726367494002,1852 Townsend st,666,Wallsend,90382,CA,United States,US
187644735867,Bärbel-HAMPDEN-Ping 37,"",Miami,13355,"",Kingdom,ZZ
948475348324,155 sw City ct,Rochdale,Germany,30864,FL,Australia,QQ
164083193213,abc"","",Jerez Fra.,11401,Cadiz,Spain,ES
198732413077,3p Grove Rochdale road,BAW,Fulifax,HX4 trW,"",Israel,GB
227433927227,95 novem blvd,"",RAW VILLAGE,3173,XYZ,Australia,IL

The schema that is mentioned above still applies 😃

revans2 · 2022-10-19T20:48:01Z

I filed rapidsai/cudf#11948 in CUDF for the issue as it is their bug. I'll also try to take a look at their code to see what I can come up with.

revans2 · 2022-10-20T16:30:27Z

It looks like there are two issues happening here. One of them is that CUDF is returning the wrong number of rows. It gets confused when it sees the "" in the row separator logic. The second one is that CUDF does not support escape characters. #129 instead it only supports the pandas default of double quotes to escape a single quote. i.e. abc"" => abc" because the "" is used to escape a single quote. I think the first one CUDF might be able to fix. The second one appears to be working as designed, at least until we can get them to add in a new feature to the CSV parser.

revans2 · 2023-08-16T18:49:27Z

the CUDF issue rapidsai/cudf#11948 is to fix it.

ilaychen added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 28, 2022

ilaychen changed the title ~~[BUG] a simple query has wrong results~~ [BUG] a simple query returns wrong results Aug 28, 2022

ilaychen closed this as completed Aug 28, 2022

ilaychen reopened this Aug 28, 2022

revans2 mentioned this issue Oct 19, 2022

[BUG] CSV reader cannot handle unquoted quote character appearing in a field rapidsai/cudf#11948

Open

revans2 self-assigned this Oct 19, 2022

revans2 added P0 Must have for release cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels Oct 19, 2022

firestarman mentioned this issue Oct 27, 2022

[BUG] rereading a written csv file drops rows #6917

Open

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

tgravescs mentioned this issue Aug 15, 2023

[BUG] Reading CSV file with "" field causes rows to not be read #8926

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] a simple query returns wrong results #6435

[BUG] a simple query returns wrong results #6435

ilaychen commented Aug 28, 2022 •

edited

Loading

viadea commented Aug 30, 2022

ilaychen commented Sep 8, 2022 •

edited

Loading

viadea commented Sep 8, 2022

viadea commented Oct 19, 2022

ilaychen commented Oct 19, 2022 •

edited

Loading

revans2 commented Oct 19, 2022

revans2 commented Oct 20, 2022

revans2 commented Aug 16, 2023 •

edited

Loading

[BUG] a simple query returns wrong results #6435

[BUG] a simple query returns wrong results #6435

Comments

ilaychen commented Aug 28, 2022 • edited Loading

viadea commented Aug 30, 2022

ilaychen commented Sep 8, 2022 • edited Loading

viadea commented Sep 8, 2022

viadea commented Oct 19, 2022

ilaychen commented Oct 19, 2022 • edited Loading

revans2 commented Oct 19, 2022

revans2 commented Oct 20, 2022

revans2 commented Aug 16, 2023 • edited Loading

ilaychen commented Aug 28, 2022 •

edited

Loading

ilaychen commented Sep 8, 2022 •

edited

Loading

ilaychen commented Oct 19, 2022 •

edited

Loading

revans2 commented Aug 16, 2023 •

edited

Loading