-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set distance function and other index params #228
Conversation
@levand would you mind taking a look at this? Particularly interested in:
|
@jeffchuber This makes sense to me, but it's slightly misleading, because the HNSW params are not an attribute of the collection, they are an attribute of the index (and a collection could conceivably have an index of a different type, or multiple indexes.) In the future we'll definitely want to introduce a more flexible way for the user to express their intent of how a collection should be indexed rather than making it a top-level param to Under the hood, we'll likely just store these index configurations in a segment's metadata anyway. Would you be ok surfacing that to the user? api.create_collection("my_collection", metadata={
'hnsw:space': 'cosine',
'hnsw:ef': 100,
'hnsw:m': 20})
}) The disadvantage here of course is that we lose type checking in the Python client. So if we want to preserve that, for better DX, then I'd suggest the following minor tweak to your code, and make the signature of IndexType = Literal["hnsw", "other"]
IndexParams = Union[HnswIndexParams, OtherIndexParams]
index_params: Optional[dict[IndexType, IndexParams]] = None, That way we have types for users, but aren't baking HNSW into the API as a special/only thing. And we'd still just store them as namespaced metadata keys under the hood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See question about what happens to index params after modification.
Also, what's up with the massive diff in the tests file and all the commented out tests? Seems wrong?
chromadb/api/__init__.py
Outdated
@@ -118,6 +126,7 @@ def _modify( | |||
current_name: str, | |||
new_name: Optional[str] = None, | |||
new_metadata: Optional[Dict] = None, | |||
new_index_params: Optional[HnswIndexParams] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does modification mean here? Changing index params would mean rebuilding the whole index. Seems like we should either do that, or error if you try to change the params after the index has already been built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now we just don't allow index modification.
@@ -81,16 +94,31 @@ def get_or_create_collection( | |||
name: str, | |||
metadata: Optional[Dict] = None, | |||
embedding_function: Optional[Callable] = None, | |||
index_params: Optional[HnswIndexParams] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we'll have to deal with alongside persisting embedding functions, but fine for now.
@atroyn any feedback on the concept of storing index params as metadata rather than adding a new column? Less complexity and risk around schema breakage. |
Also just for situational awareness: #214 contains a more comprehensive approach to migrations (e.g. https://github.com/chroma-core/chroma/blob/lukev/decoupled-architecture/migrations/sysdb/00002-segments.duckdb.sql), so the migration code in this PR will be short-lived regardless (assuming we accept the approach in 214) |
This functionality landed in #245 |
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?