[BUG] Investigate q32 and q67 for decimals potential regression #4290

abellina · 2021-12-04T06:06:19Z

This week every decimal query was faster except for q32 and q67:

q32: 1590 0.76x (5029 ms - 6619 ms)
q67: 13475 0.90x (122125 ms - 135600 ms)

This was in spark2a, Spark 3.1.1, decimals, with the cuDF jar containing rapidsai/cudf@69e6dbb + rapids-4-spark_2.12-21.12.0-20211201.153412-67.jar, comparing against the previous week, which saw the performance regression for q9, q28 and others.

We need to run these queries in a loop and identify if these diffs are noise or a real sustained problem.

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2021-12-04T19:26:35Z

As another point of information, I have been noticing inconsistency in queryTimes for q67 as early as 11/30 (this is without decimals):
I just ran it several times in a row on spark2a with cudf/rapids built from branch-22.02 head yesterday:

tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638644612126.json:  "queryTimes" : [ 59744 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638644837370.json:  "queryTimes" : [ 43945 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638644924155.json:  "queryTimes" : [ 57696 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638645090748.json:  "queryTimes" : [ 60557 ],

Q32 also shows similar inconsistency:

tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645630879.json:  "queryTimes" : [ 5266 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645673465.json:  "queryTimes" : [ 4773 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645712862.json:  "queryTimes" : [ 4380 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645781881.json:  "queryTimes" : [ 5083 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645823618.json:  "queryTimes" : [ 4502 ],

jlowe · 2022-01-13T23:29:51Z

Are there any updates or is this no longer relevant?

abellina · 2022-01-15T15:30:45Z

This is still relevant, especially for Q67. q32 is very small. It spends most of its kernel time in parquet scan (90% kernel time) but it's a single scanned batch per executor thread. So in this case we are probably seeing some overheads causing variation.

Q67 is falling back to the sort aggregate given decimal 128 data type, and the sort aggregate is taking ~90%+ of the time:

Why it is variable? I am not 100% sure, but I would imagine some sort of scheduling effect, either at the semaphore level or Spark is at play. That is just speculation.

jbrennan333 · 2022-01-20T22:41:55Z

For qq67 with decimals, this is the aggregation that takes so much time:

(28) GpuHashAggregate
Input [11]: [ss_quantity#139, ss_sales_price#142, i_category#864, i_class#865, i_brand#866, i_product_name#867, d_year#868, d_qoy#869, d_moy#870, s_store_id#871, spark_grouping_id#863L]
Keys [9]: [i_category#864, i_class#865, i_brand#866, i_product_name#867, d_year#868, d_qoy#869, d_moy#870, s_store_id#871, spark_grouping_id#863L]
Functions [1]: [partial_gpusum(gpucoalesce((ss_sales_price#142 * cast(ss_quantity#139 as decimal(10,0))), 0.00), DecimalType(28,2), false, false)]
Aggregate Attributes [2]: [sum#912, isEmpty#913]
Results [11]: [i_category#864, i_class#865, i_brand#866, i_product_name#867, d_year#868, d_qoy#869, d_moy#870, s_store_id#871, spark_grouping_id#863L, sum#914, isEmpty#915]

As an experiment to see the kernel becomes more efficient if we throw fewer larger batches at it, I tried putting a coalesce batches between the GpuExpand and GpuHashAggregate.

Original:

With the coalesce:

jbrennan333 · 2022-01-20T22:44:58Z

GpuHashAggregate is falling back to a sort because there is no atomic operation for decimal-128. @revans2 suggested that for SUMs, we might be able to get around this limitation by doing the sum in 4 32 bit parts and then putting the results back together. The operation would use more memory.

jlowe · 2022-02-08T22:20:53Z

#4688 should address the queries with DECIMAL128 sum aggregations causing a fallback to a sort-based aggregation. I don't know if there are queries that are falling back due to DECIMAL128 average aggregations which is tracked by #4722.

jlowe · 2022-02-11T16:03:42Z

Closing since the regression in Q67 was addressed by #4688 and Q32 diffs appear to be in the noise.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify performance A performance related task/issue P0 Must have for release labels Dec 4, 2021

abellina changed the title ~~[BUG] Investigate q32 and q67 for decimals~~ [BUG] Investigate q32 and q67 for decimals potential regression Dec 4, 2021

Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify P0 Must have for release labels Dec 7, 2021

jbrennan333 self-assigned this Jan 19, 2022

jlowe assigned jlowe and unassigned jbrennan333 Jan 28, 2022

jlowe closed this as completed Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Investigate q32 and q67 for decimals potential regression #4290

[BUG] Investigate q32 and q67 for decimals potential regression #4290

abellina commented Dec 4, 2021

jbrennan333 commented Dec 4, 2021

jlowe commented Jan 13, 2022

abellina commented Jan 15, 2022

jbrennan333 commented Jan 20, 2022

jbrennan333 commented Jan 20, 2022

jlowe commented Feb 8, 2022

jlowe commented Feb 11, 2022

[BUG] Investigate q32 and q67 for decimals potential regression #4290

[BUG] Investigate q32 and q67 for decimals potential regression #4290

Comments

abellina commented Dec 4, 2021

jbrennan333 commented Dec 4, 2021

jlowe commented Jan 13, 2022

abellina commented Jan 15, 2022

jbrennan333 commented Jan 20, 2022

jbrennan333 commented Jan 20, 2022

jlowe commented Feb 8, 2022

jlowe commented Feb 11, 2022