-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouped Aggregate in row format #2375
Conversation
I think that would help. Are we replacing HashAggregate completely with a new row based aggregate or do we want to support both? Does hash aggregate still have advantages for some use cases? Maybe we can have a config setting for which one to use? |
Sorry to mix two things into one PR. I would divide this as separate PRs. One for each of these ideas:
I think the choice between row-based accumulator states vs |
I am starting to check this out -- I'll try to finish today but I may run out of time. |
@yjshen do you have any benchmark numbers you can share? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some superficial comments on the API design -- I hope to dig more into the implementation later today.
This is looking very cool @yjshen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this code, and all in all I think it is great. Nice work @yjshen 🏆
I would like to see the following two things prior to approving this PR:
- See some sort of performance benchmarks showing this is faster than master
- Try this change against the IOx test suite
I will plan to try against IOx tomorrow.
Concern about Code Duplication
I am somewhat concerned that this PR ends up with a parallel implementation of GroupByHash as well as some of the aggregates.
This approach is fairly nice because it is backwards compatible and thus this PR allows us to make make incremental progress 👍
However, I worry that we now have two slightly different implementations which will diverge over time and we will be stuck with these these two forms forever if we don't have the discipline to complete the transition. This would make the codebase more complicated and harder to work with over time.
Perhaps I can make this less concerning by enumerating what work remains to entirely switch to RowAggregate (and remove AggregateStream entirely).
None => { | ||
this.finished = true; | ||
let timer = this.baseline_metrics.elapsed_compute().timer(); | ||
let result = create_batch_from_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code effectively makes one massive output record batch -- I think it is also what GroupedHashAggregateStream
does but it would be better in my opinion to stream the output (aka respect the batch_size
configuration. Maybe we can file a ticket to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I plan to do this in #1570 as my next step.
@@ -338,6 +455,42 @@ impl Accumulator for SumAccumulator { | |||
} | |||
} | |||
|
|||
#[derive(Debug)] | |||
struct SumAccumulatorV2 { | |||
index: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to document what this is an index
into -- aka document what the parameters are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new method state_index(&self) -> usize
and explained the meaning in RowAccumulator
doc.
s: &ScalarValue, | ||
) -> Result<()> { | ||
match (dt, s) { | ||
// float64 coerces everything to f64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I almost wonder how valuable supporting all these types are -- like I wonder if we can use u64
or i64
accumulators for all integer types and f64
for floats and reduce the code. I don't think this PR is making things any better or worse, but it just seems like these type match
statements are so common and repetitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. We should clean this up by probably by checking type coercions.
accessor: &mut RowAccessor, | ||
) -> Result<()> { | ||
let values = &values[0]; | ||
add_to_row(&self.datatype, self.index, accessor, &sum_batch(values)?)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is needed to go through sum_batch
here (which turns the sum into a ScalarValue
) -- perhaps we could call the appropriate sum kernel followed by a direct update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially use sum_batch
here mainly to reduce code duplication with that of SumAccumulator
, besides, there's a decimal sum_batch
that isn't included in the compute kernel yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, possibly related #2447
Thanks @alamb, for the detailed review ❤️. I'll try to answer or fix them today. Micro benchmark: aggregate_query_sqlExisting aggregate_query_sql with a newly added case: c.bench_function("aggregate_query_group_by_u64_multiple_keys", |b| {
b.iter(|| {
query(
ctx.clone(),
"SELECT u64_wide, utf8, MIN(f64), AVG(f64), COUNT(f64) \
FROM t GROUP BY u64_wide, utf8",
)
})
}); I'm using a compound group by key with many distinct values in the newly added case. The results are: The master branch
This PR
Improved: the newly introduced case with many distinct groups. Regressed: group by with fewer groups. Edit: by moving the physical plan creation part out from bench timing, there's still not much big difference on the flamegraph, AggregateStreamV1/V2::next only shows less than 10 samples and ~1% of all samples. TPC-H query 1 ( Aggregate with four distinct states)cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096 The Master branch
This PR
No difference in performance is observed, which is expected since there are few groups and mainly about the states in-cache calculation. TPC-H q1 modified, with more groups:
The master branch
This PR:
There are noticeable performance improvements as the number of grouping grows. |
Co-authored-by: Andrew Lamb <[email protected]>
@@ -144,7 +173,8 @@ fn sum_decimal_batch( | |||
} | |||
|
|||
// sums the array and returns a ScalarValue of its corresponding type. | |||
pub(crate) fn sum_batch(values: &ArrayRef) -> Result<ScalarValue> { | |||
pub(crate) fn sum_batch(values: &ArrayRef, sum_type: &DataType) -> Result<ScalarValue> { | |||
let values = &cast(values, sum_type)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the partial fix for #2455. We should cast the input array to sum result datatype first to alleviate the possibility of overflow. Further, we should have a wrapping sum kernel as well as a try_sum kernel to produce wrapped results or nulls in the case of overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if we could internally consider summing smaller integers using u128 and then detecting overflow at the end. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are very nice benchmark improvements @yjshen 👍
I ran the IOx test suite against this branch (https://github.com/influxdata/influxdb_iox/pull/4531) and it seems to have worked great
@alamb @andygrove I revisited our current row implementation and listed all the TODO items I could think of in #1861, and in the process, I think we can eliminate these code duplications and constantly improve performance. |
Which issue does this PR close?
Closes #2452.
Partly fix #2455.
Rationale for this change
Using row format in grouped aggregate has several benefits over the current
Vec<ScalarValue>
:Vec<u8>
What changes are included in this PR?
Vec<u8>
row fields update.AggregateExec
to use row-based group aggregate when applicable.Are there any user-facing changes?
No.
No.