Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect values when parsing dates from timestamps stored in CSV files #1091

Open
andygrove opened this issue Nov 10, 2020 · 2 comments
Labels
bug Something isn't working P2 Not required for release SQL part of the SQL/Dataframe plugin

Comments

@andygrove
Copy link
Contributor

Describe the bug
If I specify a schema with a DateType when reading from a CSV file containing timestamps, I get corrupt data when the plugin is enabled.

Steps/Code to reproduce bug

Create tests/src/test/resources/timestamps.csv CSV file:

2019-01-03T12:34:56.123456,1
2019-01-03T12:34:56.123456,1
2019-01-03T12:34:56.123456,1
2019-01-05T12:34:56.123456,2
2019-01-05T12:34:56.123456,3
2019-01-06T12:34:56.123456,6

Add this method to tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala:

  def timestampsAsDatesCsvDf= {
    fromCsvDf("timestamps.csv", StructType(Array(
      StructField("dates", DateType, false),
      StructField("ints", IntegerType, false)
    )))(_)
  }

Add these tests to tests/src/test/scala/com/nvidia/spark/rapids/CsvScanSuite.scala:

  testSparkResultsAreEqual(
    "Test CSV parse dates",
    datesCsvDf,
    conf=new SparkConf()) {
    df => df.withColumn("next_day", date_add(col("dates"), lit(1)))
  }

  testSparkResultsAreEqual(
    "Test CSV parse timestamps as dates",
    timestampsAsDatesCsvDf,
    conf=new SparkConf()) {
    df => df.withColumn("next_day", date_add(col("dates"), lit(1)))
  }

The first test passes but the second test fails with:

CPU: WrappedArray([2019-01-03,1,2019-01-04], [2019-01-03,1,2019-01-04], [2019-01-03,1,2019-01-04], [2019-01-05,2,2019-01-06], [2019-01-05,3,2019-01-06], [2019-01-06,6,2019-01-07])

GPU: WrappedArray([0718-10-22,1,0718-10-23], [0718-10-22,1,0718-10-23], [0718-10-22,1,0718-10-23], [2280-05-20,2,2280-05-21], [2280-05-20,3,2280-05-21], [3779-09-03,6,3779-09-04])

Expected behavior
The tests should both pass because the output should be the same with or without the plugin enabled.

Environment details (please complete the following information)
Running tests in IDE.

Additional context
N/A

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 10, 2020
@andygrove andygrove added this to the Nov 9 - Nov 20 milestone Nov 10, 2020
@andygrove andygrove self-assigned this Nov 10, 2020
@sameerz sameerz added P2 Not required for release SQL part of the SQL/Dataframe plugin and removed ? - Needs Triage Need team to review and classify labels Nov 10, 2020
@sameerz
Copy link
Collaborator

sameerz commented Nov 11, 2020

Please update the documentation at https://github.com/NVIDIA/spark-rapids/blob/branch-0.3/docs/compatibility.md#csv-dates when this is fixed.

@andygrove andygrove removed this from the Nov 9 - Nov 20 milestone Nov 11, 2020
@andygrove andygrove removed their assignment Dec 14, 2020
@revans2 revans2 mentioned this issue Apr 1, 2021
38 tasks
@revans2
Copy link
Collaborator

revans2 commented Aug 15, 2023

I think that this is working now.

scala> spark.read.schema(StructType(Seq(StructField("dates", DateType, false), StructField("ints", IntegerType)))).csv("./test.csv").collect.foreach(System.out.println)
23/08/15 19:39:03 WARN GpuOverrides: 
*Exec <FileSourceScanExec> will run on GPU

[2019-01-03,1]                                                                  
[2019-01-03,1]
[2019-01-03,1]
[2019-01-05,2]
[2019-01-05,3]
[2019-01-06,6]

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> spark.read.schema(StructType(Seq(StructField("dates", DateType, false), StructField("ints", IntegerType)))).csv("./test.csv").collect.foreach(System.out.println)
[2019-01-03,1]
[2019-01-03,1]
[2019-01-03,1]
[2019-01-05,2]
[2019-01-05,3]
[2019-01-06,6]

Could we get someone to add a test to verify that it continues to work.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Not required for release SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

3 participants