Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: replace get_* methods on Arrow blocks with get_range() #2934

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Oct 10, 2024

Description of changes

Replaces specialized methods like get_gt and get_lt with a single get_range() method that behaves similarly to the std BTreeMap::range() method. This reduces complexity/repetition and also enables queries that are bounded in both directions.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

) -> Result<Vec<(K, V)>, Box<dyn ChromaError>>
where
PrefixRange: RangeBounds<&'prefix str> + Clone,
KeyRange: RangeBounds<K> + Clone,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to remove the Clone bound by allowing borrowed ranges but I'm not entirely sure how. Methods like block.get_range() should accept both borrowed and owned ranges for ease of use. (E.x. I don't want to require users to do block.get_range(&(0..10)))

@codetheweb codetheweb marked this pull request as ready for review October 16, 2024 18:43
@codetheweb codetheweb force-pushed the feat-sparse-index-range-api branch from a6287c4 to 1bc3f50 Compare October 16, 2024 19:03
@codetheweb codetheweb force-pushed the feat-arrow-block-range-api branch from 03d6324 to 00058c2 Compare October 16, 2024 19:03
@Sicheng-Pan
Copy link
Contributor

Looks good to me

@codetheweb codetheweb requested a review from HammadB October 18, 2024 17:08
@codetheweb codetheweb force-pushed the feat-arrow-block-range-api branch from 2652357 to db61125 Compare October 18, 2024 17:16
@codetheweb codetheweb force-pushed the feat-arrow-block-range-api branch from 4930695 to 7a6c762 Compare October 24, 2024 22:14
Copy link
Contributor

@Sicheng-Pan Sicheng-Pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we can maintain one less binary search impl in the codebase now

Comment on lines +172 to +174
let prefix = prefix_array.value(mid);
let key = K::get(self.data.column(1), mid);
let cmp = f((prefix, key));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometime key may not be required to derive an order (for example, when searching for start/end of prefix range). Since this is going to be the hot path in the system, I'm wondering if it's worth it to make key evaluation lazy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but didn't see a clean way to do it. Do you have ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ask the comparator to take in a key handle:

  • Before: F: FnMut((&'me str, K)) -> Ordering
  • After: F: FnMut(&'me str, C) -> Ordering, C: Fn() -> &'me K

And we pass in || K::get(...) to the comparator.

Alternatively, ask the caller to pass in two comparators, one for prefix and the other for key. The second comparator is invoked only when the first yields Ordering::Equal (basically using a then_with)

Maybe there are better ways to do this, but this is what I have in mind right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this turns out to be somewhat difficult:

F: FnMut(&'me str, C) -> Ordering, C: Fn() -> &'me K:

Can't work without Boxing the key fetch callback.

Alternatively, ask the caller to pass in two comparators, one for prefix and the other for key. The second comparator is invoked only when the first yields Ordering::Equal (basically using a then_with)

Makes life annoying for callers because the key cmp function parameter must be optional. But if None is provided, Rust complains that that the parameter type is unknown.


Are you ok leaving as-is for now and revisiting if it shows up in flamegraphs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally fine with the current impl unless there is evidence showing that fetching the key will significantly slow us down

rust/blockstore/src/arrow/block/types.rs Outdated Show resolved Hide resolved
Comment on lines 239 to 279
/// Finds the partition point of the prefix and key.
/// Returns the index of the first element that matches the target prefix and key. If no element matches, returns the index at which the target prefix and key could be inserted to maintain sorted order.
#[inline]
fn get_key_prefix_partition_point<'me, K: ArrowReadableKey<'me>>(
&'me self,
prefix: &str,
key: Option<&K>,
) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if we should decompose this into two functions for better code clarity:

  • fn binary_search_prefix_key<K>(&self, prefix: &str, key: &K) -> Result<usize, usize> where Ok(i) means the (prefix, key) combination is found at index i and Err(i) means it is not found but can be inserted at i to maintain order
  • fn find_smallest_index_with_prefix(&self, prefix: &str) -> Result<usize, usize>, where Ok(i) means the prefix exists and starts at i and Err(i) means it is not found but can be inserted at i with a key value to maintain order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn find_smallest_index_with_prefix(&self, prefix: &str) -> Result<usize, usize>, where Ok(i) means the prefix exists and starts at i and Err(i) means it is not found but can be inserted at i with a key value to maintain order

I think this has to return an Option, you can't infer an insert location from only a prefix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume in the case the where the prefix does not exist, you can insert the prefix with any key value at that location without disturbing the order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But an Option should be sufficient too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 yes you're right
I'll leave as an Option for now so find_smallest and find_largest are similar

rust/blockstore/src/arrow/block/types.rs Show resolved Hide resolved
@codetheweb codetheweb force-pushed the feat-sparse-index-range-api branch from 6781f63 to 3d7c746 Compare October 25, 2024 23:33
@codetheweb codetheweb force-pushed the feat-arrow-block-range-api branch from d72cd6e to c665051 Compare October 25, 2024 23:33
#[inline]
fn binary_search_index<'me, K: ArrowReadableKey<'me>>(
fn binary_search_by<'me, K: ArrowReadableKey<'me>, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Curious, why does f have to be FnMut as opposed to Fn. Can't it work on immutable references?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's based on the standard library implementation. Probably that leaves more flexibility for what can be passed in as a comparator function.

Copy link
Contributor

@sanketkedia sanketkedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks much cleaner. Thank you for cleaning it up!

@codetheweb codetheweb force-pushed the feat-sparse-index-range-api branch from 3d7c746 to 7aca536 Compare October 29, 2024 16:54
@codetheweb codetheweb force-pushed the feat-arrow-block-range-api branch from c665051 to a219e15 Compare October 29, 2024 16:54
Copy link
Contributor Author

codetheweb commented Nov 4, 2024

Merge activity

  • Nov 4, 1:28 PM EST: A user started a stack merge that includes this pull request via Graphite.
  • Nov 4, 1:32 PM EST: Graphite rebased this pull request as part of a merge.
  • Nov 4, 1:33 PM EST: A user merged this pull request with Graphite.

@codetheweb codetheweb changed the base branch from feat-sparse-index-range-api to graphite-base/2934 November 4, 2024 18:29
@codetheweb codetheweb changed the base branch from graphite-base/2934 to main November 4, 2024 18:30
@codetheweb codetheweb force-pushed the feat-arrow-block-range-api branch from a219e15 to a6a50e8 Compare November 4, 2024 18:31
@codetheweb codetheweb merged commit 3cd54de into main Nov 4, 2024
68 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants