-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Investigate q32 and q67 for decimals potential regression #4290
Comments
As another point of information, I have been noticing inconsistency in queryTimes for q67 as early as 11/30 (this is without decimals):
Q32 also shows similar inconsistency:
|
Are there any updates or is this no longer relevant? |
This is still relevant, especially for Q67. q32 is very small. It spends most of its kernel time in parquet scan (90% kernel time) but it's a single scanned batch per executor thread. So in this case we are probably seeing some overheads causing variation. Q67 is falling back to the sort aggregate given decimal 128 data type, and the sort aggregate is taking ~90%+ of the time: Why it is variable? I am not 100% sure, but I would imagine some sort of scheduling effect, either at the semaphore level or Spark is at play. That is just speculation. |
For qq67 with decimals, this is the aggregation that takes so much time:
As an experiment to see the kernel becomes more efficient if we throw fewer larger batches at it, I tried putting a coalesce batches between the GpuExpand and GpuHashAggregate. With the coalesce: |
GpuHashAggregate is falling back to a sort because there is no atomic operation for decimal-128. @revans2 suggested that for SUMs, we might be able to get around this limitation by doing the sum in 4 32 bit parts and then putting the results back together. The operation would use more memory. |
Closing since the regression in Q67 was addressed by #4688 and Q32 diffs appear to be in the noise. |
This week every decimal query was faster except for q32 and q67:
q32: 1590 0.76x (5029 ms - 6619 ms)
q67: 13475 0.90x (122125 ms - 135600 ms)
This was in spark2a, Spark 3.1.1, decimals, with the cuDF jar containing rapidsai/cudf@69e6dbb + rapids-4-spark_2.12-21.12.0-20211201.153412-67.jar, comparing against the previous week, which saw the performance regression for q9, q28 and others.
We need to run these queries in a loop and identify if these diffs are noise or a real sustained problem.
The text was updated successfully, but these errors were encountered: