Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS] FAILED in spark 3.0.3+ #2964

Closed
pxLi opened this issue Jul 19, 2021 · 2 comments · Fixed by #2984
Closed
Assignees
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 19, 2021

Window func UT failed on spark 3.0.3+ (not affect 3.1.x)
To Reproduce,

mvn test -Pspark303tests,snapshot-shims
or
mvn test -Pspark304tests,snapshot-shims

The Error,

[2021-07-19T08:05:38.204Z] - IGNORE ORDER, WITH DECIMALS: [Window] [MIXED WINDOW SPECS]  *** FAILED ***

[2021-07-19T08:05:38.204Z]   canonicalizationMatchesCpu=false != canonicalizationMatchesGpu=true

[2021-07-19T08:05:38.204Z]   CPU plan: *(10) Project [none#4L, none#5, none#2, none#3L, none#6, none#7L]

[2021-07-19T08:05:38.204Z]   +- !Window [max(none#1L) windowspecdefinition(none#0L, none#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, currentrow$(), unboundedfollowing$())) AS #0L], [none#0L], [none#1L ASC NULLS FIRST]

[2021-07-19T08:05:38.204Z]      +- *(9) !Sort [none#0L ASC NULLS FIRST, none#1L ASC NULLS FIRST], false, 0

[2021-07-19T08:05:38.204Z]         +- *(9) !Project [none#0L, none#2L, none#3, none#4L, none#5L, none#6, none#7]

[2021-07-19T08:05:38.204Z]            +- !Window [row_number() windowspecdefinition(none#0L, none#1 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS #0], [none#0L], [none#1 DESC NULLS LAST]

[2021-07-19T08:05:38.204Z]               +- *(8) !Sort [none#0L ASC NULLS FIRST, none#1 DESC NULLS LAST], false, 0

[2021-07-19T08:05:38.204Z]                  +- *(8) Project [none#1L, none#3, none#4L, none#5, none#6L, none#7L, none#8]

[2021-07-19T08:05:38.204Z]                     +- !Window [min(none#0) windowspecdefinition(none#1L, none#2 DESC NULLS LAST, specifiedwindowframe(RangeFrame, -2 days, 3 days)) AS #0], [none#1L], [none#2 DESC NULLS LAST]

[2021-07-19T08:05:38.204Z]                        +- *(7) !Sort [none#1L ASC NULLS FIRST, none#2 DESC NULLS LAST], false, 0

[2021-07-19T08:05:38.204Z]                           +- *(7) !Project [none#0, none#1L, none#3, none#4, none#5L, none#6, none#7L, none#8L]

[2021-07-19T08:05:38.204Z]                              +- Window [sum(cast(none#0 as bigint)) windowspecdefinition(none#1L, none#2 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -2 days, currentrow$())) AS #0L], [none#1L], [none#2 ASC NULLS FIRST]

[2021-07-19T08:05:38.204Z]                                 +- *(6) Sort [none#1L ASC NULLS FIRST, none#2 ASC NULLS FIRST], false, 0

[2021-07-19T08:05:38.204Z]                                    +- Exchange hashpartitioning(none#1L, 2), true, [id=#90159]

[2021-07-19T08:05:38.204Z]                                       +- *(5) Project [none#0, none#1L, none#2, none#3, none#6, none#7L, none#8, none#9L]

[2021-07-19T08:05:38.204Z]                                          +- !Window [count(1) windowspecdefinition(none#4, none#5 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS #0L], [none#4], [none#5 DESC NULLS LAST]

[2021-07-19T08:05:38.204Z]                                             +- *(4) !Sort [none#4 ASC NULLS FIRST, none#5 DESC NULLS LAST], false, 0

[2021-07-19T08:05:38.204Z]                                                +- *(4) !Project [none#0, none#1L, none#2, none#3, none#4, none#6, none#7, none#8L, none#9]

[2021-07-19T08:05:38.204Z]                                                   +- !Window [max(none#0) windowspecdefinition(none#4, none#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -5, 5)) AS #0], [none#4], [none#5 ASC NULLS FIRST]

[2021-07-19T08:05:38.204Z]                                                      +- *(3) !Sort [none#4 ASC NULLS FIRST, none#5 ASC NULLS FIRST], false, 0

[2021-07-19T08:05:38.204Z]                                                         +- Exchange hashpartitioning(none#4, 2), true, [id=#90139]

[2021-07-19T08:05:38.205Z]                                                            +- *(2) Project [none#3, none#0L, cast(none#2L as timestamp) AS #0, cast(none#2L as timestamp) AS #1, none#1, cast(none#2L as timestamp) AS #2, cast(none#2L as timestamp) AS #3, cast(none#2L as timestamp) AS #4, none#2L]

[2021-07-19T08:05:38.205Z]                                                               +- Exchange RoundRobinPartitioning(1), false, [id=#90133]

[2021-07-19T08:05:38.205Z]                                                                  +- *(1) ColumnarToRow

[2021-07-19T08:05:38.205Z]                                                                     +- FileScan orc [none#0L,none#1,none#2L,none#3] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-266/tests/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<uid:bigint,uname:string,datelong:bigint,dollars:int>

[2021-07-19T08:05:38.205Z]   

[2021-07-19T08:05:38.205Z]   GPU plan: GpuColumnarToRow false

[2021-07-19T08:05:38.205Z]   +- !GpuProject [none#5L, none#8, none#3, none#6L, none#7, none#4L]

[2021-07-19T08:05:38.205Z]      +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4L, none#5L, none#6L, none#7, gpumin(none#0) gpuwindowspecdefinition(none#1L, none#2 DESC NULLS LAST, gpuspecifiedwindowframe(RangeFrame, -2 days, 3 days)) AS #0], [none#1L], [none#2 DESC NULLS LAST]

[2021-07-19T08:05:38.205Z]         +- GpuCoalesceBatches batchedbykey(none#1L ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]            +- GpuSort [none#1L ASC NULLS FIRST, none#2 DESC NULLS LAST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]               +- !GpuProject [none#0, none#1L, none#2, none#4, none#5L, none#6L, none#7L, none#8]

[2021-07-19T08:05:38.205Z]                  +- !GpuRunningWindow [none#0, none#1L, none#2, none#3, none#4, none#5L, none#6L, none#7L, gpurownumber() gpuwindowspecdefinition(none#1L, none#3 DESC NULLS LAST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(currentrow$()))) AS #0], [none#1L], [none#3 DESC NULLS LAST]

[2021-07-19T08:05:38.205Z]                     +- !GpuSort [none#1L ASC NULLS FIRST, none#3 DESC NULLS LAST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                        +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                           +- GpuColumnarExchange gpuhashpartitioning(none#1L, 2), true, [id=#90664]

[2021-07-19T08:05:38.205Z]                              +- !GpuProject [none#0, none#1L, none#2, none#5, none#6, none#7L, none#8L, none#9L]

[2021-07-19T08:05:38.205Z]                                 +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7L, none#8L, gpucount(1) gpuwindowspecdefinition(none#3, none#4 DESC NULLS LAST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS #0L], [none#3], [none#4 DESC NULLS LAST]

[2021-07-19T08:05:38.205Z]                                    +- GpuCoalesceBatches batchedbykey(none#3 ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                       +- GpuSort [none#3 ASC NULLS FIRST, none#4 DESC NULLS LAST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                          +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                             +- GpuColumnarExchange gpuhashpartitioning(none#3, 2), true, [id=#90653]

[2021-07-19T08:05:38.205Z]                                                +- !GpuProject [none#0, none#1L, none#3, none#4, none#5, none#6, none#7, none#8L, none#9L]

[2021-07-19T08:05:38.205Z]                                                   +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7, none#8L, gpusum(none#9L, LongType) gpuwindowspecdefinition(none#1L, none#2 ASC NULLS FIRST, gpuspecifiedwindowframe(RangeFrame, -2 days, gpuspecialframeboundary(currentrow$()))) AS #0L], [none#1L], [none#2 ASC NULLS FIRST]

[2021-07-19T08:05:38.205Z]                                                      +- GpuCoalesceBatches batchedbykey(none#1L ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                                         +- !GpuProject [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7, none#8L, cast(none#0 as bigint) AS #0L]

[2021-07-19T08:05:38.205Z]                                                            +- GpuSort [none#1L ASC NULLS FIRST, none#2 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                                               +- !GpuProject [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#8, none#9L]

[2021-07-19T08:05:38.205Z]                                                                  +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7L, none#8, gpumax(none#7L) gpuwindowspecdefinition(none#1L, none#7L ASC NULLS FIRST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(currentrow$()), gpuspecialframeboundary(unboundedfollowing$()))) AS #0L], [none#1L], [none#7L ASC NULLS FIRST]

[2021-07-19T08:05:38.205Z]                                                                     +- GpuCoalesceBatches batchedbykey(none#1L ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                                                        +- GpuSort [none#1L ASC NULLS FIRST, none#7L ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                                                           +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                                                              +- GpuColumnarExchange gpuhashpartitioning(none#1L, 2), true, [id=#90632]

[2021-07-19T08:05:38.205Z]                                                                                 +- !GpuProject [none#0, none#1L, none#2, none#3, none#4, none#6, none#7, none#8L, none#9]

[2021-07-19T08:05:38.205Z]                                                                                    +- !GpuWindow [none#0, none#1L, none#2, none#3, none#4, none#5, none#6, none#7, none#8L, gpumax(none#0) gpuwindowspecdefinition(none#4, none#5 ASC NULLS FIRST, gpuspecifiedwindowframe(RowFrame, -5, 5)) AS #0], [none#4], [none#5 ASC NULLS FIRST]

[2021-07-19T08:05:38.205Z]                                                                                       +- GpuCoalesceBatches batchedbykey(none#4 ASC NULLS FIRST)

[2021-07-19T08:05:38.205Z]                                                                                          +- !GpuSort [none#4 ASC NULLS FIRST, none#5 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@5d676a02

[2021-07-19T08:05:38.205Z]                                                                                             +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                                                                                +- GpuColumnarExchange gpuhashpartitioning(none#4, 2), true, [id=#90621]

[2021-07-19T08:05:38.205Z]                                                                                                   +- GpuProject [none#3, none#0L, cast(none#2L as timestamp) AS #0, cast(none#2L as timestamp) AS #1, none#1, cast(none#2L as timestamp) AS #2, cast(none#2L as timestamp) AS #3, cast(none#2L as timestamp) AS #4, none#2L]

[2021-07-19T08:05:38.205Z]                                                                                                      +- GpuShuffleCoalesce 2147483647

[2021-07-19T08:05:38.205Z]                                                                                                         +- GpuColumnarExchange gpuroundrobinpartitioning(1), false, [id=#90616]

[2021-07-19T08:05:38.205Z]                                                                                                            +- GpuFileGpuScan orc [none#0L,none#1,none#2L,none#3] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-266/tests/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<uid:bigint,uname:string,datelong:bigint,dollars:int> (SparkQueryCompareTestSuite.scala:367)
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 19, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 20, 2021
@Salonijain27 Salonijain27 added this to the July 19 - July 30 milestone Jul 20, 2021
@jlowe jlowe self-assigned this Jul 20, 2021
@jlowe
Copy link
Contributor

jlowe commented Jul 20, 2021

I spent quite a bit of time today trying to track this down. First step was to run the window test suite but it surprisingly doesn't fail when just running those tests. It does fail when running all the tests. I was able to track down the start of the failures to #2919 which is very odd since nothing in that PR had anything to do with canonicalization. However the tests regularly fail after that PR and succeed before it.

Interestingly, if I revert just the test cases from that PR then canonicalization testing reliably passes again. So somehow state from previous tests is leaking into these tests. I've also seen odd things like CPU canonicalization failing but GPU working or vice-versa on other window tests. It doesn't always fail in the same way but does consistently seem to fail in some way after the tests in #2919.

@jlowe
Copy link
Contributor

jlowe commented Jul 21, 2021

The problem lies in the Spark 3.0.x logical optimizer. I verified that it is not deterministic in the order which it processes windows that use the same range. This even bears out in the test failure output above, note that it's the CPU that is failing to canonicalize in one case:

[2021-07-19T08:05:38.204Z]   canonicalizationMatchesCpu=false != canonicalizationMatchesGpu=true

Turns out we are almost always failing to canonicalize these range queries on Spark 3.0.x, but this wasn't failing before because both the CPU and GPU were failing to canonicalize and it compares-the-compares to determine if the test fails. false == false, so it wasn't failing. Somehow #2919 was perturbing the logical optimizer to sometimes come up with the same plan twice in a row for these tests, and that's why it would fail. I verified that Spark 3.1+ is always producing the same logical plan for a particular window query even if it has the same ranges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants