[BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+ #2964

pxLi · 2021-07-19T08:56:49Z

Window func UT failed on spark 3.0.3+ (not affect 3.1.x)
To Reproduce,

mvn test -Pspark303tests,snapshot-shims
or
mvn test -Pspark304tests,snapshot-shims

The Error,

[2021-07-19T08:05:38.204Z] - IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS]  *** FAILED ***

[2021-07-19T08:05:38.204Z]   canonicalizationMatchesCpu=false != canonicalizationMatchesGpu=true

[2021-07-19T08:05:38.204Z]   CPU plan: *(10) Project [none#4L, none#5, none#2, none#3L, none#6, none#7L]

[2021-07-19T08:05:38.204Z]   +- !Window [max(none#1L) windowspecdefinition(none#0L, none#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, currentrow$(), unboundedfollowing$())) AS #0L], [none#0L], [none#1L ASC NULLS FIRST]

[2021-07-19T08:05:38.204Z]      +- *(9) !Sort [none#0L ASC NULLS FIRST, none#1L ASC NULLS FIRST], false, 0

[2021-07-19T08:05:38.204Z]         +- *(9) !Project [none#0L, none#2L, none#3, none#4L, none#5L, none#6, none#7]

[2021-07-19T08:05:38.204Z]            +- !Window [row_number() windowspecdefinition(none#0L, none#1 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS #0], [none#0L], [none#1 DESC NULLS LAST]

[2021-07-19T08:05:38.204Z]               +- *(8) !Sort [none#0L ASC NULLS FIRST, none#1 DESC NULLS LAST], false, 0

[2021-07-19T08:05:38.204Z]                  +- *(8) Project [none#1L, none#3, none#4L, none#5, none#6L, none#7L, none#8]

[2021-07-19T08:05:38.204Z]                     +- !Window [min(none#0) windowspecdefinition(none#1L, none#2 DESC NULLS LAST, specifiedwindowframe(RangeFrame, -2 days, 3 days)) AS #0], [none#1L], [none#2 DESC NULLS LAST]

[2021-07-19T08:05:38.204Z]                        +- *(7) !Sort [none#1L ASC NULLS FIRST, none#2 DESC NULLS LAST], false, 0

[2021-07-19T08:05:38.204Z]                           +- *(7) !Project [none#0, none#1L, none#3, none#4, none#5L, none#6, none#7L, none#8L]

[2021-07-19T08:05:38.204Z]                              +- Window [sum(cast(none#0 as bigint)) windowspecdefinition(none#1L, none#2 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -2 days, currentrow$())) AS #0L], [none#1L], [none#2 ASC NULLS FIRST]

[2021-07-19T08:05:38.204Z]                                 +- *(6) Sort [none#1L ASC NULLS FIRST, none#2 ASC NULLS FIRST], false, 0

[2021-07-19T08:05:38.204Z]                                    +- Exchange hashpartitioning(none#1L, 2), true, [id=#90159]

[2021-07-19T08:05:38.204Z]                                       +- *(5) Project [none#0, none#1L, none#2, none#3, none#6, none#7L, none#8, none#9L]

[2021-07-19T08:05:38.204Z]                                          +- !Window [count(1) windowspecdefinition(none#4, none#5 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS #0L], [none#4], [none#5 DESC NULLS LAST]

[2021-07-19T08:05:38.204Z]                                             +- *(4) !Sort [none#4 ASC NULLS FIRST, none#5 DESC NULLS LAST], false, 0

[2021-07-19T08:05:38.204Z]                                                +- *(4) !Project [none#0, none#1L, none#2, none#3, none#4, none#6, none#7, none#8L, none#9]

[2021-07-19T08:05:38.204Z]                                                   +- !Window [max(none#0) windowspecdefinition(none#4, none#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -5, 5)) AS #0], [none#4], [none#5 ASC NULLS FIRST]

[2021-07-19T08:05:38.204Z]                                                      +- *(3) !Sort [none#4 ASC NULLS FIRST, none#5 ASC NULLS FIRST], false, 0

[2021-07-19T08:05:38.204Z]                                                         +- Exchange hashpartitioning(none#4, 2), true, [id=#90139]

[2021-07-19T08:05:38.205Z]                                                            +- *(2) Project [none#3, none#0L, cast(none#2L as timestamp) AS #0, cast(none#2L as timestamp) AS #1, none#1, cast(none#2L as timestamp) AS #2, cast(none#2L as timestamp) AS #3, cast(none#2L as timestamp) AS #4, none#2L]

[2021-07-19T08:05:38.205Z]                                                               +- Exchange RoundRobinPartitioning(1), false, [id=#90133]

[2021-07-19T08:05:38.205Z]                                                                  +- *(1) ColumnarToRow

[2021-07-19T08:05:38.205Z]                                                                     +- FileScan orc [none#0L,none#1,none#2L,none#3] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-266/tests/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<uid:bigint,uname:string,datelong:bigint,dollars:int>

[2021-07-19T08:05:38.205Z]   

[2021-07-19T08:05:38.205Z]   GPU plan: GpuColumnarToRow false

[2021-07-19T08:05:38.205Z]   +- !GpuProject [none#5L, none#8, none#3, none#6L, none#7, none#4L]

[2021-07-19T08:05:38.205Z]      +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4L, none#5L, none#6L, none#7, gpumin(none#0) gpuwindowspecdefinition(none#1L, none#2 DESC NULLS LAST, gpuspecifiedwindowframe(RangeFrame, -2 days, 3 days)) AS #0], [none#1L], [none#2 DESC NULLS LAST]

[2021-07-19T08:05:38.205Z]         +- GpuCoalesceBatches batchedbykey(none#1L ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]            +- GpuSort [none#1L ASC NULLS FIRST, none#2 DESC NULLS LAST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]               +- !GpuProject [none#0, none#1L, none#2, none#4, none#5L, none#6L, none#7L, none#8]

[2021-07-19T08:05:38.205Z]                  +- !GpuRunningWindow [none#0, none#1L, none#2, none#3, none#4, none#5L, none#6L, none#7L, gpurownumber() gpuwindowspecdefinition(none#1L, none#3 DESC NULLS LAST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(currentrow$()))) AS #0], [none#1L], [none#3 DESC NULLS LAST]

[2021-07-19T08:05:38.205Z]                     +- !GpuSort [none#1L ASC NULLS FIRST, none#3 DESC NULLS LAST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                        +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                           +- GpuColumnarExchange gpuhashpartitioning(none#1L, 2), true, [id=#90664]

[2021-07-19T08:05:38.205Z]                              +- !GpuProject [none#0, none#1L, none#2, none#5, none#6, none#7L, none#8L, none#9L]

[2021-07-19T08:05:38.205Z]                                 +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7L, none#8L, gpucount(1) gpuwindowspecdefinition(none#3, none#4 DESC NULLS LAST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS #0L], [none#3], [none#4 DESC NULLS LAST]

[2021-07-19T08:05:38.205Z]                                    +- GpuCoalesceBatches batchedbykey(none#3 ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                       +- GpuSort [none#3 ASC NULLS FIRST, none#4 DESC NULLS LAST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                          +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                             +- GpuColumnarExchange gpuhashpartitioning(none#3, 2), true, [id=#90653]

[2021-07-19T08:05:38.205Z]                                                +- !GpuProject [none#0, none#1L, none#3, none#4, none#5, none#6, none#7, none#8L, none#9L]

[2021-07-19T08:05:38.205Z]                                                   +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7, none#8L, gpusum(none#9L, LongType) gpuwindowspecdefinition(none#1L, none#2 ASC NULLS FIRST, gpuspecifiedwindowframe(RangeFrame, -2 days, gpuspecialframeboundary(currentrow$()))) AS #0L], [none#1L], [none#2 ASC NULLS FIRST]

[2021-07-19T08:05:38.205Z]                                                      +- GpuCoalesceBatches batchedbykey(none#1L ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                                         +- !GpuProject [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7, none#8L, cast(none#0 as bigint) AS #0L]

[2021-07-19T08:05:38.205Z]                                                            +- GpuSort [none#1L ASC NULLS FIRST, none#2 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                                               +- !GpuProject [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#8, none#9L]

[2021-07-19T08:05:38.205Z]                                                                  +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7L, none#8, gpumax(none#7L) gpuwindowspecdefinition(none#1L, none#7L ASC NULLS FIRST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(currentrow$()), gpuspecialframeboundary(unboundedfollowing$()))) AS #0L], [none#1L], [none#7L ASC NULLS FIRST]

[2021-07-19T08:05:38.205Z]                                                                     +- GpuCoalesceBatches batchedbykey(none#1L ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                                                        +- GpuSort [none#1L ASC NULLS FIRST, none#7L ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                                                           +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                                                              +- GpuColumnarExchange gpuhashpartitioning(none#1L, 2), true, [id=#90632]

[2021-07-19T08:05:38.205Z]                                                                                 +- !GpuProject [none#0, none#1L, none#2, none#3, none#4, none#6, none#7, none#8L, none#9]

[2021-07-19T08:05:38.205Z]                                                                                    +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7, none#8L, gpumax(none#0) gpuwindowspecdefinition(none#4, none#5 ASC NULLS FIRST, gpuspecifiedwindowframe(RowFrame, -5, 5)) AS #0], [none#4], [none#5 ASC NULLS FIRST]

[2021-07-19T08:05:38.205Z]                                                                                       +- GpuCoalesceBatches batchedbykey(none#4 ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                                                                          +- !GpuSort [none#4 ASC NULLS FIRST, none#5 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                                                                             +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                                                                                +- GpuColumnarExchange gpuhashpartitioning(none#4, 2), true, [id=#90621]

[2021-07-19T08:05:38.205Z]                                                                                                   +- GpuProject [none#3, none#0L, cast(none#2L as timestamp) AS #0, cast(none#2L as timestamp) AS #1, none#1, cast(none#2L as timestamp) AS #2, cast(none#2L as timestamp) AS #3, cast(none#2L as timestamp) AS #4, none#2L]

[2021-07-19T08:05:38.205Z]                                                                                                      +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                                                                                         +- GpuColumnarExchange gpuroundrobinpartitioning(1), false, [id=#90616]

[2021-07-19T08:05:38.205Z]                                                                                                            +- GpuFileGpuScan orc [none#0L,none#1,none#2L,none#3] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-266/tests/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<uid:bigint,uname:string,datelong:bigint,dollars:int> (SparkQueryCompareTestSuite.scala:367)

The text was updated successfully, but these errors were encountered:

jlowe · 2021-07-20T22:37:36Z

I spent quite a bit of time today trying to track this down. First step was to run the window test suite but it surprisingly doesn't fail when just running those tests. It does fail when running all the tests. I was able to track down the start of the failures to #2919 which is very odd since nothing in that PR had anything to do with canonicalization. However the tests regularly fail after that PR and succeed before it.

Interestingly, if I revert just the test cases from that PR then canonicalization testing reliably passes again. So somehow state from previous tests is leaking into these tests. I've also seen odd things like CPU canonicalization failing but GPU working or vice-versa on other window tests. It doesn't always fail in the same way but does consistently seem to fail in some way after the tests in #2919.

jlowe · 2021-07-21T15:52:59Z

The problem lies in the Spark 3.0.x logical optimizer. I verified that it is not deterministic in the order which it processes windows that use the same range. This even bears out in the test failure output above, note that it's the CPU that is failing to canonicalize in one case:

[2021-07-19T08:05:38.204Z]   canonicalizationMatchesCpu=false != canonicalizationMatchesGpu=true

Turns out we are almost always failing to canonicalize these range queries on Spark 3.0.x, but this wasn't failing before because both the CPU and GPU were failing to canonicalize and it compares-the-compares to determine if the test fails. false == false, so it wasn't failing. Somehow #2919 was perturbing the logical optimizer to sometimes come up with the same plan twice in a row for these tests, and that's why it would fail. I verified that Spark 3.1+ is always producing the same logical plan for a particular window query even if it has the same ranges.

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 19, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 20, 2021

Salonijain27 added this to the July 19 - July 30 milestone Jul 20, 2021

jlowe self-assigned this Jul 20, 2021

jlowe mentioned this issue Jul 21, 2021

Avoid comparing window range canonicalized plans on Spark 3.0.x #2984

Merged

pxLi closed this as completed in #2984 Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+ #2964

[BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+ #2964

pxLi commented Jul 19, 2021 •

edited

Loading

jlowe commented Jul 20, 2021

jlowe commented Jul 21, 2021

[BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+ #2964

[BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+ #2964

Comments

pxLi commented Jul 19, 2021 • edited Loading

jlowe commented Jul 20, 2021

jlowe commented Jul 21, 2021

pxLi commented Jul 19, 2021 •

edited

Loading