-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add new value type for spann posting lists #3022
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
7c118ff
to
5cd3ed1
Compare
pub struct SpannPostingList<'referred_data> { | ||
pub doc_offset_ids: &'referred_data [u32], | ||
pub doc_versions: &'referred_data [u32], | ||
// Flattened out embeddings for all documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that we assume dimension is bookkeeped elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the dimension is assumed to be doc_embeddings.len() / doc_offset_ids.len()
lol. Curious, do you think of that as brittle and error prone?
@@ -250,7 +250,7 @@ impl<V: ArrowWriteableValue<SizeTracker = SingleColumnSizeTracker>> SingleColumn | |||
) | |||
.into_inner(); | |||
|
|||
let mut value_builder = V::get_arrow_builder(inner.size_tracker); | |||
let mut value_builder = V::get_arrow_builder(inner.size_tracker.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clone could be hairy! What if code makes assumptions about it. Does this really need to take ownership?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah can we change this to not take ownership - I know thats out of scope of your PR but this is bug-prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this interface is very restrictive. It assumes that only the size tracker is needed to construct the builder. You could require more than just the size information for that. For instance, I needed the embedding dimension (which is conveniently a part of size tracker as an optional field since some value types don't possess that).
Even below, For V::finish(), I needed the embedding dimension so I had to add the size_tracker param again.
In general, It's hard to predict ahead of time what future value types added will need to construct their corresponding arrow arrays. Having a broad API V::into_arrow(delta) that returns the arrow array and the field from the delta is sufficient IMO.
Curious on thoughts @HammadB @codetheweb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would largely be fixed with the generic tuple API we've talked about before? e.g. then we wouldn't need specialization like this (I suppose if you wanted to have Vec<f32> -> FixedSizeList
as an optimization you would still need a separate wrapper trait)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some comments that would be good to address before merging.
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
None