-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract GroupValues (#6969) #7016
Conversation
} | ||
|
||
// account for memory growth in scratch space | ||
*allocated += self.scratch_space.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't see a reason to keep track of the delta and not just use try_resize at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related response: #6932 (comment)
I am running benchmarks on this PR |
My benchmark runs confirm what @tustvold suspected that there is no major performance change for this PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me -- thank you @tustvold
I am very excited to see what this looks like with a special case for single columns (I expect it to be screaming fast)
@@ -60,6 +60,151 @@ pub(crate) enum ExecutionState { | |||
|
|||
use super::AggregateExec; | |||
|
|||
/// An interning store for group keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// An interning store for group keys | |
/// Stores group key values and their mapping to group_index | |
/// | |
/// This is a trait to allow special casing certain kinds of keys | |
/// like single column primitive arrays |
fn flush(&mut self) -> Result<Vec<ArrayRef>>; | ||
} | ||
|
||
/// A [`GroupValues`] making use of [`Rows`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A [`GroupValues`] making use of [`Rows`] | |
/// A [`GroupValues`] which stores group values using the arrow_row format [`Rows`] |
/// | ||
/// keys: u64 hashes of the GroupValue | ||
/// values: (hash, group_index) | ||
map: RawTable<(u64, usize)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised to see the map
in this structure -- I was expecting only the group values.
Is your idea to store the group keys inside the map, as shown to be so effective by @yahoNanJing in apache/arrow-rs#4524 (comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also was thinking this structure will allow for an easy insertion of the FixedWidthRowFormat if that turns out to be better as well. So I think this PR is a step in the right direction regardless of which approach we choose to take
@@ -60,6 +60,151 @@ pub(crate) enum ExecutionState { | |||
|
|||
use super::AggregateExec; | |||
|
|||
/// An interning store for group keys | |||
trait GroupValues: Send { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I would love to see this in its own module aggregates/group_values.rs
but that can be a follow on PR for sure
batch_hashes.resize(n_rows, 0); | ||
create_hashes(cols, &self.random_state, batch_hashes)?; | ||
|
||
for (row, &hash) in batch_hashes.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will likely conflict with #6932 (review) but I think it should be straightforward to update
/// └────────────┘ ││ └────────┘ ││ ││ └────────┘ ││ | ||
/// │└────────────┘│ │└────────────┘│ | ||
/// └──────────────┘ └──────────────┘ | ||
/// ┌────────────┐ ┌──────────────┐ ┌──────────────┐ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we choose to go with this approach, I can make some pictures for RowGroupValues as well
} | ||
|
||
// account for memory growth in scratch space | ||
*allocated += self.scratch_space.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related response: #6932 (comment)
Which issue does this PR close?
Part of #6969
Rationale for this change
Extracts the group key storage into a type-erased trait object so that it can then be specialized for different column types.
I have not been able to run the benchmarks on this yet, but I don't expect it to have a major material impact
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?