[ENH]: replace `get_*` methods on Arrow blocks with `get_range()` #2934

codetheweb · 2024-10-10T23:47:32Z

Description of changes

Replaces specialized methods like get_gt and get_lt with a single get_range() method that behaves similarly to the std BTreeMap::range() method. This reduces complexity/repetition and also enables queries that are bounded in both directions.

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

codetheweb · 2024-10-10T23:47:43Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @codetheweb and the rest of your teammates on Graphite

github-actions · 2024-10-10T23:47:43Z

codetheweb · 2024-10-16T18:42:58Z

rust/blockstore/src/arrow/blockfile.rs

+    ) -> Result<Vec<(K, V)>, Box<dyn ChromaError>>
+    where
+        PrefixRange: RangeBounds<&'prefix str> + Clone,
+        KeyRange: RangeBounds<K> + Clone,


I think it should be possible to remove the Clone bound by allowing borrowed ranges but I'm not entirely sure how. Methods like block.get_range() should accept both borrowed and owned ranges for ease of use. (E.x. I don't want to require users to do block.get_range(&(0..10)))

rust/blockstore/src/arrow/block/types.rs

Sicheng-Pan · 2024-10-16T22:03:59Z

Looks good to me

rust/blockstore/src/arrow/block/types.rs

Sicheng-Pan

Glad we can maintain one less binary search impl in the codebase now

Sicheng-Pan · 2024-10-25T20:48:03Z

rust/blockstore/src/arrow/block/types.rs

+            let prefix = prefix_array.value(mid);
+            let key = K::get(self.data.column(1), mid);
+            let cmp = f((prefix, key));


Sometime key may not be required to derive an order (for example, when searching for start/end of prefix range). Since this is going to be the hot path in the system, I'm wondering if it's worth it to make key evaluation lazy?

I thought about that but didn't see a clean way to do it. Do you have ideas?

Maybe we can ask the comparator to take in a key handle:

Before: F: FnMut((&'me str, K)) -> Ordering

After: F: FnMut(&'me str, C) -> Ordering, C: Fn() -> &'me K

And we pass in || K::get(...) to the comparator.

Alternatively, ask the caller to pass in two comparators, one for prefix and the other for key. The second comparator is invoked only when the first yields Ordering::Equal (basically using a then_with)

Maybe there are better ways to do this, but this is what I have in mind right now.

yeah, that works

this turns out to be somewhat difficult:

F: FnMut(&'me str, C) -> Ordering, C: Fn() -> &'me K:

Can't work without Boxing the key fetch callback.

Alternatively, ask the caller to pass in two comparators, one for prefix and the other for key. The second comparator is invoked only when the first yields Ordering::Equal (basically using a then_with)

Makes life annoying for callers because the key cmp function parameter must be optional. But if None is provided, Rust complains that that the parameter type is unknown.

Are you ok leaving as-is for now and revisiting if it shows up in flamegraphs?

I'm totally fine with the current impl unless there is evidence showing that fetching the key will significantly slow us down

rust/blockstore/src/arrow/block/types.rs

Sicheng-Pan · 2024-10-25T21:07:05Z

rust/blockstore/src/arrow/block/types.rs

+    /// Finds the partition point of the prefix and key.
+    /// Returns the index of the first element that matches the target prefix and key. If no element matches, returns the index at which the target prefix and key could be inserted to maintain sorted order.
+    #[inline]
+    fn get_key_prefix_partition_point<'me, K: ArrowReadableKey<'me>>(
+        &'me self,
+        prefix: &str,
+        key: Option<&K>,
+    ) -> usize {


I'm thinking if we should decompose this into two functions for better code clarity:

fn binary_search_prefix_key<K>(&self, prefix: &str, key: &K) -> Result<usize, usize> where Ok(i) means the (prefix, key) combination is found at index i and Err(i) means it is not found but can be inserted at i to maintain order

fn find_smallest_index_with_prefix(&self, prefix: &str) -> Result<usize, usize>, where Ok(i) means the prefix exists and starts at i and Err(i) means it is not found but can be inserted at i with a key value to maintain order

fn find_smallest_index_with_prefix(&self, prefix: &str) -> Result<usize, usize>, where Ok(i) means the prefix exists and starts at i and Err(i) means it is not found but can be inserted at i with a key value to maintain order

I think this has to return an Option, you can't infer an insert location from only a prefix

I assume in the case the where the prefix does not exist, you can insert the prefix with any key value at that location without disturbing the order

But an Option should be sufficient too

🤦 yes you're right
I'll leave as an Option for now so find_smallest and find_largest are similar

rust/blockstore/src/arrow/block/types.rs

sanketkedia · 2024-10-29T06:32:26Z

rust/blockstore/src/arrow/block/types.rs

    #[inline]
-    fn binary_search_index<'me, K: ArrowReadableKey<'me>>(
+    fn binary_search_by<'me, K: ArrowReadableKey<'me>, F>(


Nice! Curious, why does f have to be FnMut as opposed to Fn. Can't it work on immutable references?

I believe that's based on the standard library implementation. Probably that leaves more flexibility for what can be passed in as a comparator function.

sanketkedia

this looks much cleaner. Thank you for cleaning it up!

codetheweb · 2024-11-04T18:28:38Z

Merge activity

Nov 4, 1:28 PM EST: A user started a stack merge that includes this pull request via Graphite.
Nov 4, 1:32 PM EST: Graphite rebased this pull request as part of a merge.
Nov 4, 1:33 PM EST: A user merged this pull request with Graphite.

This was referenced Oct 10, 2024

[ENH]: replace get_block_ids_* with get_block_ids_range() in SparseIndex #2921

Merged

[ENH]: replace get_* methods on memory blockfile impl with get_range() #2935

Merged

[ENH]: replace .get_* methods on blockfile API with .get_range() #2936

Merged

codetheweb force-pushed the feat-sparse-index-range-api branch from b3d0b4f to e8b32ab Compare October 11, 2024 22:48

codetheweb force-pushed the feat-arrow-block-range-api branch from 7c5996d to 353d385 Compare October 11, 2024 22:48

codetheweb force-pushed the feat-sparse-index-range-api branch from e8b32ab to 78f69e7 Compare October 16, 2024 17:57

codetheweb force-pushed the feat-arrow-block-range-api branch 2 times, most recently from f47cbdb to 03d6324 Compare October 16, 2024 18:28

codetheweb commented Oct 16, 2024

View reviewed changes

codetheweb marked this pull request as ready for review October 16, 2024 18:43

codetheweb requested a review from Sicheng-Pan October 16, 2024 18:44

codetheweb force-pushed the feat-sparse-index-range-api branch from a6287c4 to 1bc3f50 Compare October 16, 2024 19:03

codetheweb force-pushed the feat-arrow-block-range-api branch from 03d6324 to 00058c2 Compare October 16, 2024 19:03

Sicheng-Pan reviewed Oct 16, 2024

View reviewed changes