Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comet can produce different results to Spark when averaging a decimal #1354

Closed
andygrove opened this issue Jan 30, 2025 · 6 comments · Fixed by #1372
Closed

Comet can produce different results to Spark when averaging a decimal #1354

andygrove opened this issue Jan 30, 2025 · 6 comments · Fixed by #1372
Assignees
Labels
bug Something isn't working
Milestone

Comments

@andygrove
Copy link
Member

andygrove commented Jan 30, 2025

Describe the bug

Given the following SQL, where c1 is a tinyint and c7 is a decimal(14,6):

SELECT c1, Avg(c7) FROM t1 GROUP BY c1 ORDER BY c1

Some results are different between Spark and Comet, perhaps due to a decimal promotion or rounding difference.

!== Correct Answer - 256 ==                 == Spark Answer - 256 ==
 struct<c1:tinyint,avg(c7):decimal(14,6)>   struct<c1:tinyint,avg(c7):decimal(14,6)>
 [68,0.595938]                              [68,0.595938]
![69,0.520313]                              [69,0.520312]
 [70,0.498929]                              [70,0.498929]

Steps to reproduce

  test("avg decimal") {
    withTempDir { dir =>
      val path = new Path(dir.toURI.toString, "test.parquet")
      val filename = path.toString
      val random = new Random(42)
      withSQLConf(CometConf.COMET_ENABLED.key -> "false") {
        ParquetGenerator.makeParquetFile(
          random,
          spark,
          filename,
          10000,
          DataGenOptions(
            allowNull = true,
            generateNegativeZero = true,
            generateArray = false,
            generateStruct = false,
            generateMap = false))
      }
      val table = spark.read.parquet(filename).coalesce(1)
      table.createOrReplaceTempView("t1")
      checkSparkAnswer("SELECT c1, Avg(c7) FROM t1 GROUP BY c1 ORDER BY c1")
    }
  }

Expected behavior

No response

Additional context

No response

@andygrove andygrove added the bug Something isn't working label Jan 30, 2025
@andygrove andygrove added this to the 0.6.0 milestone Jan 30, 2025
@andygrove
Copy link
Member Author

The issue may be due to formatting differences when casting decimal to string

@andygrove
Copy link
Member Author

I think that Spark uses half-up rounding and Arrow truncates

@andygrove andygrove self-assigned this Feb 5, 2025
@andygrove
Copy link
Member Author

It turns out that Comet isn't using it's AvgDecimal expression but instead it is using the Avg expression and is operating on an UnscaledValue which represents the long value of the decimal.

@andygrove
Copy link
Member Author

The issue is with the cast of the avg float64 value 0.5153125. Spark rounds up to 0.515313 and Comet rounds down (or truncates) to 0.515312.

@andygrove
Copy link
Member Author

We should port the logic from org.apache.spark.sql.types.Decimal#changePrecision to resolve this.

@andygrove
Copy link
Member Author

I filed an issue for the root case: #1371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant