-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize groupby::scan
#9754
Optimize groupby::scan
#9754
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9754 +/- ##
================================================
- Coverage 10.49% 10.42% -0.07%
================================================
Files 119 119
Lines 20305 20471 +166
================================================
+ Hits 2130 2134 +4
- Misses 18175 18337 +162
Continue to review full report at Codecov.
|
This PR has been labeled |
This would be nice to have work, as we use this in a number of areas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just have an idea to potentially make this even simpler.
Is there a benchmark that shows the perf benefit? |
I think the original plan was better to short circuit in the sort_helper. It is also used in rolling. Otherwise, we'd have to fix it again for rolling. Would it be possible or would it require a bigger refactor? |
@devavret It's definitely possible and actually involves fewer code changes than the current refactor. The question is really whether we want to materialize |
You can memoize both the view and the owning column. Anyway I now think your method would be cleaner |
@vuule I've just added the benchmark results. This PR brings about 1.6x speedups for presorted scans. |
} | ||
} | ||
|
||
BENCHMARK_DEFINE_F(Groupby, BasicSumScan)(::benchmark::State& state) { BM_basic_sum_scan(state); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using nvbench
is preferred over googlebench for any new benchmarks.
We want to move our all benchmarks to nvbench.
If it's not much effort, please use nvbench here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvbench
requires target kernels to take an explicit stream (see here) but CUDA stream is not exposed to the public groupby::scan
. What I did in JOIN_NVBENCH
is to expose stream arguments to hash join APIs. Do we want to do the same for groupby::scan
? @jrhemstad Any comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather do this: NVIDIA/nvbench#13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will leave the new scan benchmark as gbench for this PR and look into NVIDIA/nvbench#13.
@gpucibot merge |
Closes #8522
This PR gets rid of redundant rearranging processes in
groupby::scan
if input values are presorted. Instead of a short circuit insort_helper
, it adds an early exit in the scan functor to avoid materializingsorted_values
/grouped_values
thus reducing memory footprint. This optimization brings a 1.6x speedup for presorted scan operations.