-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: replace get_*
methods on Arrow blocks with get_range()
#2934
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @codetheweb and the rest of your teammates on Graphite |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
b3d0b4f
to
e8b32ab
Compare
7c5996d
to
353d385
Compare
e8b32ab
to
78f69e7
Compare
f47cbdb
to
03d6324
Compare
) -> Result<Vec<(K, V)>, Box<dyn ChromaError>> | ||
where | ||
PrefixRange: RangeBounds<&'prefix str> + Clone, | ||
KeyRange: RangeBounds<K> + Clone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to remove the Clone bound by allowing borrowed ranges but I'm not entirely sure how. Methods like block.get_range()
should accept both borrowed and owned ranges for ease of use. (E.x. I don't want to require users to do block.get_range(&(0..10))
)
a6287c4
to
1bc3f50
Compare
03d6324
to
00058c2
Compare
Looks good to me |
2652357
to
db61125
Compare
4930695
to
7a6c762
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad we can maintain one less binary search impl in the codebase now
let prefix = prefix_array.value(mid); | ||
let key = K::get(self.data.column(1), mid); | ||
let cmp = f((prefix, key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometime key may not be required to derive an order (for example, when searching for start/end of prefix range). Since this is going to be the hot path in the system, I'm wondering if it's worth it to make key evaluation lazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that but didn't see a clean way to do it. Do you have ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can ask the comparator to take in a key handle:
- Before:
F: FnMut((&'me str, K)) -> Ordering
- After:
F: FnMut(&'me str, C) -> Ordering, C: Fn() -> &'me K
And we pass in || K::get(...)
to the comparator.
Alternatively, ask the caller to pass in two comparators, one for prefix and the other for key. The second comparator is invoked only when the first yields Ordering::Equal
(basically using a then_with
)
Maybe there are better ways to do this, but this is what I have in mind right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this turns out to be somewhat difficult:
F: FnMut(&'me str, C) -> Ordering, C: Fn() -> &'me K
:
Can't work without Boxing the key fetch callback.
Alternatively, ask the caller to pass in two comparators, one for prefix and the other for key. The second comparator is invoked only when the first yields Ordering::Equal (basically using a then_with)
Makes life annoying for callers because the key cmp function parameter must be optional. But if None
is provided, Rust complains that that the parameter type is unknown.
Are you ok leaving as-is for now and revisiting if it shows up in flamegraphs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally fine with the current impl unless there is evidence showing that fetching the key will significantly slow us down
/// Finds the partition point of the prefix and key. | ||
/// Returns the index of the first element that matches the target prefix and key. If no element matches, returns the index at which the target prefix and key could be inserted to maintain sorted order. | ||
#[inline] | ||
fn get_key_prefix_partition_point<'me, K: ArrowReadableKey<'me>>( | ||
&'me self, | ||
prefix: &str, | ||
key: Option<&K>, | ||
) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if we should decompose this into two functions for better code clarity:
fn binary_search_prefix_key<K>(&self, prefix: &str, key: &K) -> Result<usize, usize>
whereOk(i)
means the(prefix, key)
combination is found at indexi
andErr(i)
means it is not found but can be inserted ati
to maintain orderfn find_smallest_index_with_prefix(&self, prefix: &str) -> Result<usize, usize>
, whereOk(i)
means the prefix exists and starts ati
andErr(i)
means it is not found but can be inserted ati
with a key value to maintain order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn find_smallest_index_with_prefix(&self, prefix: &str) -> Result<usize, usize>, where Ok(i) means the prefix exists and starts at i and Err(i) means it is not found but can be inserted at i with a key value to maintain order
I think this has to return an Option, you can't infer an insert location from only a prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume in the case the where the prefix does not exist, you can insert the prefix with any key value at that location without disturbing the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But an Option
should be sufficient too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 yes you're right
I'll leave as an Option for now so find_smallest and find_largest are similar
6781f63
to
3d7c746
Compare
d72cd6e
to
c665051
Compare
#[inline] | ||
fn binary_search_index<'me, K: ArrowReadableKey<'me>>( | ||
fn binary_search_by<'me, K: ArrowReadableKey<'me>, F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Curious, why does f have to be FnMut
as opposed to Fn
. Can't it work on immutable references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that's based on the standard library implementation. Probably that leaves more flexibility for what can be passed in as a comparator function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks much cleaner. Thank you for cleaning it up!
3d7c746
to
7aca536
Compare
c665051
to
a219e15
Compare
a219e15
to
a6a50e8
Compare
Description of changes
Replaces specialized methods like get_gt and get_lt with a single get_range() method that behaves similarly to the std BTreeMap::range() method. This reduces complexity/repetition and also enables queries that are bounded in both directions.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?