Comet can produce different results to Spark when averaging a decimal #1354

andygrove · 2025-01-30T15:20:52Z

Describe the bug

Given the following SQL, where c1 is a tinyint and c7 is a decimal(14,6):

SELECT c1, Avg(c7) FROM t1 GROUP BY c1 ORDER BY c1

Some results are different between Spark and Comet, perhaps due to a decimal promotion or rounding difference.

!== Correct Answer - 256 ==                 == Spark Answer - 256 ==
 struct<c1:tinyint,avg(c7):decimal(14,6)>   struct<c1:tinyint,avg(c7):decimal(14,6)>
 [68,0.595938]                              [68,0.595938]
![69,0.520313]                              [69,0.520312]
 [70,0.498929]                              [70,0.498929]

Steps to reproduce

  test("avg decimal") {
    withTempDir { dir =>
      val path = new Path(dir.toURI.toString, "test.parquet")
      val filename = path.toString
      val random = new Random(42)
      withSQLConf(CometConf.COMET_ENABLED.key -> "false") {
        ParquetGenerator.makeParquetFile(
          random,
          spark,
          filename,
          10000,
          DataGenOptions(
            allowNull = true,
            generateNegativeZero = true,
            generateArray = false,
            generateStruct = false,
            generateMap = false))
      }
      val table = spark.read.parquet(filename).coalesce(1)
      table.createOrReplaceTempView("t1")
      checkSparkAnswer("SELECT c1, Avg(c7) FROM t1 GROUP BY c1 ORDER BY c1")
    }
  }

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2025-01-30T15:34:53Z

The issue may be due to formatting differences when casting decimal to string

andygrove · 2025-01-30T15:39:43Z

I think that Spark uses half-up rounding and Arrow truncates

andygrove · 2025-02-05T18:02:26Z

It turns out that Comet isn't using it's AvgDecimal expression but instead it is using the Avg expression and is operating on an UnscaledValue which represents the long value of the decimal.

andygrove · 2025-02-05T18:24:51Z

The issue is with the cast of the avg float64 value 0.5153125. Spark rounds up to 0.515313 and Comet rounds down (or truncates) to 0.515312.

andygrove · 2025-02-05T18:29:04Z

We should port the logic from org.apache.spark.sql.types.Decimal#changePrecision to resolve this.

andygrove · 2025-02-06T16:46:47Z

I filed an issue for the root case: #1371

andygrove added the bug Something isn't working label Jan 30, 2025

andygrove added this to the 0.6.0 milestone Jan 30, 2025

andygrove self-assigned this Feb 5, 2025

This was referenced Feb 6, 2025

Make cast from float/double to decimal compatible with Spark #1371

Open

fix: Mark cast from float/double to decimal as incompatible #1372

Merged

andygrove closed this as completed in #1372 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comet can produce different results to Spark when averaging a decimal #1354

Comet can produce different results to Spark when averaging a decimal #1354

andygrove commented Jan 30, 2025 •

edited

Loading

andygrove commented Jan 30, 2025

andygrove commented Jan 30, 2025

andygrove commented Feb 5, 2025

andygrove commented Feb 5, 2025

andygrove commented Feb 5, 2025

andygrove commented Feb 6, 2025

Comet can produce different results to Spark when averaging a decimal #1354

Comet can produce different results to Spark when averaging a decimal #1354

Comments

andygrove commented Jan 30, 2025 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Additional context

andygrove commented Jan 30, 2025

andygrove commented Jan 30, 2025

andygrove commented Feb 5, 2025

andygrove commented Feb 5, 2025

andygrove commented Feb 5, 2025

andygrove commented Feb 6, 2025

andygrove commented Jan 30, 2025 •

edited

Loading