Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't store hashes in GroupOrdering #7029

Merged
merged 3 commits into from
Jul 19, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

The approach of storing hashes in GroupOrdering was causing merge conflicts for #7016 and is not actually necessary

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 19, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me

FYI @mustafasrepo and @ozankabak -- this effectively should improve the speed of streamed / bounded group by

for (idx, &hash) in hashes.iter().enumerate() {
self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
self.group_ordering.remove_groups(n);
// SAFETY: self.map outlives iterator and is not modified concurrently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 628 to 634
unsafe {
for bucket in self.map.iter() {
match bucket.as_ref().1.checked_sub(n) {
None => self.map.erase(bucket),
Some(sub) => bucket.as_mut().1 = sub,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is both wonderfully elegant as well as cryptic. How about some comments (this is so I don't have to refigure this out the next time I see this code):

Suggested change
unsafe {
for bucket in self.map.iter() {
match bucket.as_ref().1.checked_sub(n) {
None => self.map.erase(bucket),
Some(sub) => bucket.as_mut().1 = sub,
}
}
unsafe {
for bucket in self.map.iter() {
// decrement group index by n
match bucket.as_ref().1.checked_sub(n) {
// group index was < n, so remove from table
None => self.map.erase(bucket),
// group index was >= n, shift value down
Some(sub) => bucket.as_mut().1 = sub,
}
}

I double checked https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawIter.html

You must not free the hash table while iterating (including via growing/shrinking).
It is fine to erase a bucket that has been yielded by the iterator.
Erasing a bucket that has not yet been yielded by the iterator may still result in the iterator yielding that bucket (unless reflect_remove is called).
It is unspecified whether an element inserted after the iterator was created will be yielded by that iterator (unless reflect_insert is called).
The order in which the iterator yields bucket is unspecified and may change in the future.

Which seems to be followed 👍

@tustvold tustvold merged commit a3db191 into apache:main Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants