Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Investigate q32 and q67 for decimals potential regression #4290

Closed
abellina opened this issue Dec 4, 2021 · 7 comments
Closed

[BUG] Investigate q32 and q67 for decimals potential regression #4290

abellina opened this issue Dec 4, 2021 · 7 comments
Assignees
Labels
bug Something isn't working P1 Nice to have for release performance A performance related task/issue

Comments

@abellina
Copy link
Collaborator

abellina commented Dec 4, 2021

This week every decimal query was faster except for q32 and q67:

q32: 1590 0.76x (5029 ms - 6619 ms)
q67: 13475 0.90x (122125 ms - 135600 ms)

This was in spark2a, Spark 3.1.1, decimals, with the cuDF jar containing rapidsai/cudf@69e6dbb + rapids-4-spark_2.12-21.12.0-20211201.153412-67.jar, comparing against the previous week, which saw the performance regression for q9, q28 and others.

We need to run these queries in a loop and identify if these diffs are noise or a real sustained problem.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify performance A performance related task/issue P0 Must have for release labels Dec 4, 2021
@abellina abellina changed the title [BUG] Investigate q32 and q67 for decimals [BUG] Investigate q32 and q67 for decimals potential regression Dec 4, 2021
@jbrennan333
Copy link
Contributor

As another point of information, I have been noticing inconsistency in queryTimes for q67 as early as 11/30 (this is without decimals):
I just ran it several times in a row on spark2a with cudf/rapids built from branch-22.02 head yesterday:

tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638644612126.json:  "queryTimes" : [ 59744 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638644837370.json:  "queryTimes" : [ 43945 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638644924155.json:  "queryTimes" : [ 57696 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q67-1638645090748.json:  "queryTimes" : [ 60557 ],

Q32 also shows similar inconsistency:

tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645630879.json:  "queryTimes" : [ 5266 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645673465.json:  "queryTimes" : [ 4773 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645712862.json:  "queryTimes" : [ 4380 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645781881.json:  "queryTimes" : [ 5083 ],
tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q32-1638645823618.json:  "queryTimes" : [ 4502 ],

@Salonijain27 Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify P0 Must have for release labels Dec 7, 2021
@jlowe
Copy link
Member

jlowe commented Jan 13, 2022

Are there any updates or is this no longer relevant?

@abellina
Copy link
Collaborator Author

This is still relevant, especially for Q67. q32 is very small. It spends most of its kernel time in parquet scan (90% kernel time) but it's a single scanned batch per executor thread. So in this case we are probably seeing some overheads causing variation.

Q67 is falling back to the sort aggregate given decimal 128 data type, and the sort aggregate is taking ~90%+ of the time:

Screenshot from 2022-01-15 09-23-44

Why it is variable? I am not 100% sure, but I would imagine some sort of scheduling effect, either at the semaphore level or Spark is at play. That is just speculation.

@jbrennan333 jbrennan333 self-assigned this Jan 19, 2022
@jbrennan333
Copy link
Contributor

For qq67 with decimals, this is the aggregation that takes so much time:

(28) GpuHashAggregate
Input [11]: [ss_quantity#139, ss_sales_price#142, i_category#864, i_class#865, i_brand#866, i_product_name#867, d_year#868, d_qoy#869, d_moy#870, s_store_id#871, spark_grouping_id#863L]
Keys [9]: [i_category#864, i_class#865, i_brand#866, i_product_name#867, d_year#868, d_qoy#869, d_moy#870, s_store_id#871, spark_grouping_id#863L]
Functions [1]: [partial_gpusum(gpucoalesce((ss_sales_price#142 * cast(ss_quantity#139 as decimal(10,0))), 0.00), DecimalType(28,2), false, false)]
Aggregate Attributes [2]: [sum#912, isEmpty#913]
Results [11]: [i_category#864, i_class#865, i_brand#866, i_product_name#867, d_year#868, d_qoy#869, d_moy#870, s_store_id#871, spark_grouping_id#863L, sum#914, isEmpty#915]

As an experiment to see the kernel becomes more efficient if we throw fewer larger batches at it, I tried putting a coalesce batches between the GpuExpand and GpuHashAggregate.

Original:
q67-spark2a-orig

With the coalesce:

q67-spark2a-coalesce

@jbrennan333
Copy link
Contributor

GpuHashAggregate is falling back to a sort because there is no atomic operation for decimal-128. @revans2 suggested that for SUMs, we might be able to get around this limitation by doing the sum in 4 32 bit parts and then putting the results back together. The operation would use more memory.

@jlowe jlowe assigned jlowe and unassigned jbrennan333 Jan 28, 2022
@jlowe
Copy link
Member

jlowe commented Feb 8, 2022

#4688 should address the queries with DECIMAL128 sum aggregations causing a fallback to a sort-based aggregation. I don't know if there are queries that are falling back due to DECIMAL128 average aggregations which is tracked by #4722.

@jlowe
Copy link
Member

jlowe commented Feb 11, 2022

Closing since the regression in Q67 was addressed by #4688 and Q32 diffs appear to be in the noise.

@jlowe jlowe closed this as completed Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants