-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to hashbrown 0.15.1: migrate from hashbrown::raw::RawTable
to hashbrown::hash_table::HashTable
#13433
Comments
hashbrown::raw::RawTable
to hashbrown::hash_table::HashTable
hashbrown::raw::RawTable
to hashbrown::hash_table::HashTable
Thanks @crepererum for the steady progress on this |
TBH the last two batches are rather hard:
|
/// Calculates the size of the `PruningJoinHashMap` in bytes. | |
/// | |
/// # Returns | |
/// The size of the hash map in bytes. | |
pub(crate) fn size(&self) -> usize { | |
self.map.allocation_info().1.size() + self.next.capacity() * size_of::<u64>() | |
} |
hashbrown
0.15 offers that, but then we have to roll the RawTable
->HashTable
change together with the upgrade, not as separate steps.
ArrowHashTable
This is a bit of a wild interface and I dunno if it's worth the 15% uplift TBH:
datafusion/datafusion/physical-plan/src/aggregates/topk/hash_table.rs
Lines 63 to 86 in 47569b2
/// An interface to hide the generic type signature of TopKHashTable behind arrow arrays | |
pub trait ArrowHashTable { | |
fn set_batch(&mut self, ids: ArrayRef); | |
fn len(&self) -> usize; | |
// JUSTIFICATION | |
// Benefit: ~15% speedup + required to index into RawTable from binary heap | |
// Soundness: the caller must provide valid indexes | |
unsafe fn update_heap_idx(&mut self, mapper: &[(usize, usize)]); | |
// JUSTIFICATION | |
// Benefit: ~15% speedup + required to index into RawTable from binary heap | |
// Soundness: the caller must provide a valid index | |
unsafe fn heap_idx_at(&self, map_idx: usize) -> usize; | |
unsafe fn take_all(&mut self, indexes: Vec<usize>) -> ArrayRef; | |
// JUSTIFICATION | |
// Benefit: ~15% speedup + required to index into RawTable from binary heap | |
// Soundness: the caller must provide valid indexes | |
unsafe fn find_or_insert( | |
&mut self, | |
row_idx: usize, | |
replace_idx: usize, | |
map: &mut Vec<(usize, usize)>, | |
) -> (usize, bool); | |
} |
The issue here is that HashTable
doesn't offer an unsafe "directly address the bucket via index" kind of interface because it is fundamentally rather risky. I wonder if we should somewhat redesign this part. As far as I understand, this uses the following concepts:
- data heap: a "heap" (which I guess is just a vector) to store the actual payload data
- mutable slot: a slot that stores a mutable index to the data heap. I guess this is used to update the data pointer whenever a new/better value was found. The slots are referenced by a simple
usize
index. - key lookup: A way to lookup of key->mutable slot.
The current solution fuses 2 & 3 into a single RawTable
w/ a LOT of unsafe. I think we could deconstruct that into a HashTable
(for 3) + Vec
(for 2). This can be done BEFORE the hashbrown
0.15 update.
I think this is something @avantgardnerio and @thinkharderdev may know more about. Perhaps they can offer some suggestions / thoughts about improving the performance or some benchmarks that we can use to see how much performance would be affected if we switched to the newer hashbrown version |
The main reason that 15% was so important was to keep it on par with the unoptimized version. If CPU is equal, then the optimizer rule was easy: just always use this rule. If we are okay taking a performance hit, or if @crepererum 's idea keeps performance the same, then there's no reason to keep the unsafe code. |
The basic idea is just a hash map where the values are sorted so you can use as a combo aggregation by key + topk heap. @avantgardnerio and I went through a couple iterations on the design and landed on this because it doesn't regress performance when the grouping key is low cardinality. |
My idea was to ideally not regress significantly, at least not without discussing it first. |
What
Migrate from
hashbrown::raw::RawTable
tohashbrown::hash_table::HashTable
.Why
RawTable
andraw_entry
are removed in hashbrown 0.15.1. See #13256 . Personally I think it's the right thing to do becauseRawTable
was a bit of a weird API.How
First, we need to get some memory accounting for
HashTable
. This can be done in a similar way todatafusion/datafusion/common/src/utils/proxy.rs
Lines 110 to 111 in e25f5e7
Then, migrate every
RawTable
toHashTable
. SinceRawTable
is used all over the place, I think this should be split into multiple PRs. Luckily,HashTable
is already available in hashbrown 0.14.x, so we can do this iteratively before upgrading to hashbrown 0.15.x.Progress
hashbrown
RawTable
uses toHashTable
#13514hashbrown
RawTable
uses toHashTable
(round 2) #13524hashbrown
RawTable
uses toHashTable
(round 3) #13658The text was updated successfully, but these errors were encountered: